Migrations: migrating attributes, pt. I
What is this substack about? Here are the highlights from the first 25 issues.
In this chapter of the “Migrations” series we can now discuss probably the most complex operation that is possible in this space: migrating an attribute to a different physical representation. See the previous issues for the discussion of “Adding” and “Getting rid of” attributes, links and anchors.
We’re going to discuss the migration of only a single attribute, for simplicity and clarity. We’re going to present the most comprehensive multi-step procedure, even though in practice some of the steps may be shortcutted. Double-checking data at almost every point allows to minimize the possibility of data loss or corruption during the migration.
There are maybe half a dozen of common ways to store attributes (on a physical level):
A table column (using main table, side table, or a per-attribute table);
Entity-attribute-value approach (EAV);
A key in a JSON-typed column;
A file in a filesystem, or in S3-style storages (especially for binary objects such a image files);
Using out-of-bounds values of the data type, for example encoding a boolean value in the sign of an integer number;
Etc., etc. There are many other creative attempts to rethink what the database provides.
We may want to change the physical representation of an attribute for many reasons:
Performance improvements from switching to a better representation;
Moving sensitive data to a better-protected data storage for compliance reasons;
Changing representation from a deprecated physical representation to the one mandated by the company’s engineering guidelines (e.g., making the schema more strict);
Moving data away from the storage that is running out of long-term disk space;
Overview of the migration procedure
We’re going to take the following steps:
Prepare the new physical storage (e.g. create a table column);
Implement cross-checking tool that compares old and new storage and reports discrepancies;
Make application code that sets attribute values to double-write to both old and new storage;
Implement the tool for bulk incremental sync of attribute values from old to new storage;
Make application code that reads attribute values to read from both old and new storage, with the following substeps:
compare the results and use values from the old storage;
compare the results and use values from the new storage;
Make application code to no longer read the old storage;
Prepare old storage for stopping writing to it;
Clean up both cross-checking tool and sync tool;
Make application code to stop writing to the old storage [this is the point of no return];
Get rid of the old storage;
In practice this procedure can take many days, or even weeks and months for complicated cases. The time needed would depend on:
Amount of data being migrated;
Deployment velocity;
Velocity of changing code, particularly finding all the places where the attribute values are read and written;
System observability, maturity of monitoring and error reporting;
Before and after each step we’re going to monitor for certain invariants and changes in system behaviour. Some steps would be iterative: if we detect that something did not go as expected, we go back and change the areas that were overlooked.
Step 1: Preparing the new physical storage
So, you’ve played another round of Game of Tables and you have a certain idea about where you’re going to move the data. It may be a column in the existing table, or a brand new table. It may also be a schema-less solution, such as Entity-Attribute-Value scheme, or a JSON-typed field.
If you use a column-based approach you will have to first create the column (or the table). It’s going to be completely empty, without any data for our attribute, of course. Adding a column to an existing table may be a long operation, and you would have to take extra care just to do that. For example, you may need to use a tool like “online schema change” that is going to rewrite the existing table in a controlled manner. Without this, there is a chance that the table will get locked by the “ALTER TABLE” statement, potentially for a very long time, and that it would disrupt the system operations.
Creating a new table, on the other hand, is a trivial operation in virtually all databases. Schema-less solutions also require just the minimal preparations to store the new attribute. Maybe you’d need to register the name of the attribute in some kind of allow-list of attributes.
The juxtaposition of relative complexity of introducing new attribute storage does not imply that some of the storage approaches are necessarily “better” than others. The point of this section is that you (and others who work with your system) must understand what exactly happens on every step of the way, which tools must be used and what are the consequences of chosen representation).
Normally adding a new empty column should not substantially change the performance of the current system. However, it may actually happen! For example, the data density of the table space may decrease because of the new field, so to retrieve the same amount of data the server would have to read more disk blocks. Another possible scenario would be that after the table is altered, the data distribution statistics would be recalculated and it’s possible that the query planner will decide to use a different query plan, based on the updated statistics. This new query plan may or may not be better than the previous one. This is rare, but there are real-world stories about such things happening, so you should be prepared for that, even if only theoretically.
To be continued…
This issue is almost 900 words already, so it’s time to wrap up. We’ll continue discussing the following steps in the next issue.