Migrations: migrating attributes, pt. III
What is this substack about? Here are the highlights from the first 25 issues.
Welcome to Part III of the “Migrating attributes” series. In the first two parts we made a brief introduction, established a 10-step procedure of the migration (reproduced below for convenience), and discussed the first three steps. Today we’re going to discuss step 4.
Prepare the new physical storage (e.g. create a table column);
Implement cross-checking tool that compares old and new storage;
Make application code that sets attribute values to double-write to both old and new storage;
Implement the tool for bulk incremental sync of attribute values from old to new storage;
Make application code that reads attribute values to read from both old and new storage, with the following substeps:
compare the results and use values from the old storage;
compare the results and use values from the new storage;
Make application code to no longer read the old storage;
Prepare old storage for stopping writing to it;
Make application code to stop writing to the old storage [this is the point of no return];
Clean up a) cross-checking tool; b) sync tool;
Get rid of the old storage;
Step 4: Implement the tool for bulk incremental sync of attribute values from old to new storage;
After the first three steps we’ve prepared the new storage for the existing attribute, and there is already a bit of data in this storage. We also have a cross-checking tool that reports how much data still needs to be copied from the old to the new storage. We can now begin implementing the tool that synchronizes data between old and new storage.
Conceptually the algorithm is pretty simple:
for attribute values that are set in the old storage, and not set in the new storage: copy the values to the new storage;
for attribute values that are not set in the old storage but set in the new storage: unset attribute values;
for attribute values that are set in both old and new storage, but to a different value: copy the value from old to new.
In practice there are several additional important requirements. First, the synchronization tool must generate a minimal amount of write traffic. If attribute values are the same in both old and new storage, such rows must be ignored. Basically, after the complete run of the tool, if there were no further writes to the old storage, and we re-run the tool, it should generate no write operations. This is important because we want to control the level of system stress during the migration.
Second, the writes should be batched, if possible. This is a general advice on improving performance that is very often applicable in the typical database environment. Processing rows one by one may put unnecessary load on the servers, and greatly increase processing time. This is important because we want to decrease the turnaround time during the typical multi-attempt migration process, and we want to save developer’s time and stress.
There are some environments where batching is not possible, for example migrating image files from the file system to something like S3.
Third, the writes should be throttled to prevent replicating issues. If you need to migrate a few dozen million attribute values, and you send all of that data as fast as possible, you have a chance of overloading the replica servers (and the primary server, too). So you need to track replication and pause to let the replication clear out. This topic has been discussed previously on this substack: “How to delete a lot of data, part I” and “Part II”.
Lowering the barrier
Fourth, one important thing that your engineering organization should strive for is making the migrations as accessible to everyone as possible. The process and technology of migrating data must be well-known, well-established and available to every team member. This would allow you to be more agile in handling your data, experimenting with different representations, fixing past errors and paying out the technical debt.
If you look at the proposed 10-step sequence, you can see that some of the steps could be skipped, short-circuited or combined with other steps. This would increase the risk of introducing errors and outages, but if the data is not too important this is the price you could be willing to pay in the name of development speed.
This is all true, but migrating low-importance data brings an opportunity to become familiar with the migration process, to test the process, and to improve support for data migrations in your company’s ecosystem. People who had the migration experience will become more conscious of the data management in your codebase, because they will be exposed to more phases of the data lifecycle.
The barrier to the high-importance data migration will also be lowered as more people become acquainted with changing the codebase to support data migration. This may allow you to migrate even the most crucial data that traditionally has a lot of old growth organic code around it, and resists any change.
If we agree on this, the first thing that comes to mind is to create a sort of a data migration toolkit (or even framework), that would handle the common tasks, and allow the developer to provide only the changing parts. This may be a good idea, but it also needs to be not too strict. The risk is that this framework will quickly and easily handle the common case (for example, migrating the column from one table to another in the same relational database), but will stop helping if the data migration becomes more unusual (for example, migrating JSON-encoded attribute to a key-value representation, or migrating from Postgres to Cassandra).
This strictness will bring us back to square one in terms of process complexity. If you have a very comfortable tool that suggests a specific migration scenario, you will be “punished” if you want to try out something that was not tried before: a new technology, or a new data representation. Basically, a bit of a “golden cage” situation. Don’t overfit on convenience, and always be ready to handle the problems on the tail end.
To be continued…
In the next post we’ll discuss step 5: switching the codebase to read from the old to the new storage.