Migrations: migrating attributes, pt. II
What is this substack about? Here are the highlights from the first 25 issues.
Welcome to the Part II of the “Migrating attributes” series. In the first part we made a brief introduction and established a 10-step procedure of the migration (reproduced below for convenience). Today we’re going to discuss steps 2 and 3.
Migration procedure:
Prepare the new physical storage (e.g. create a table column);
Implement cross-checking tool that compares old and new storage;
Make application code that sets attribute values to double-write to both old and new storage;
Implement the tool for bulk incremental sync of attribute values from old to new storage;
Make application code that reads attribute values to read from both old and new storage, with the following substeps:
compare the results and use values from the old storage;
compare the results and use values from the new storage;
Make application code to no longer read the old storage;
Prepare old storage for stopping writing to it;
Make application code to stop writing to the old storage [this is the point of no return];
Clean up a) cross-checking tool; b) sync tool;
Get rid of the old storage;
Step 2: Implement cross-checking tool that compares old and new storage
So, on the previous step we created the new physical storage for data, but no data has been migrated yet. What we can do now is implement a tool that cross-checks both old and new storage, and report if there are any discrepancies.
We’re going to be using this tool continuously almost until we reach step 8. Step 8 is the point of no return, so we only take it if the tool reliably reports no discrepancies.
We want the tool to be fast (relative to the amount of data), so that the turn-around time is minimal. Thus we would almost certainly want to use batch operations, if the storage supports it. Some storage implementations, such as file systems, do not have batch operations.
For very complicated cases you may want to implement some automatic parallelizing, but try to have this tool as simple and straightforward as possible. If the cross-checking takes much more than a few hours, it may make sense to implement a restart after the tool is terminated.
It’s unnecessary to mention that the tool needs to be correct. You may want to implement it in a very defensive way to handle various data modeling quirks such as NULLs, unexpected values, etc. If the tool is incorrect and you proceed to step 8, the data loss is going to be introduced.
Generally we distinguish between set attribute and unset attribute. For some data types it’s not possible to distinguish between an unset attribute and a set attribute with a certain value. For example, in commonly used boolean data encoding, the false value is the default (unset) value, and so you can’t distinguish between the two. But for many important data types, such as strings and blobs, it’s perfectly possible. The tool needs to be able to handle that.
Initially, after step 1 the tool is supposed to report up to 100% of discrepancies, because the data has not been migrated. The recommended output from the tool should include the following:
Number of processed records (rows, or IDs);
Number of attribute values that are set in the old storage and not set in the new storage;
Number of attribute values that are not set in the old storage and set in new storage;
Number of attribute values that are set in both storages, but have different values;
Processing time and speed, in human-readable form: how much time a single run of the tool takes, and how many rows are processed per second. This is useful, because sometimes you may want to improve the development experience by optimizing. Also, this information allows you to get a rough, unscientific but useful idea of the performance of your storage;
You want to track the migration progress, so in addition to the number of discrepancies, you need to show human-readable percentages. We’d recommend using two decimal places: it’s a good balance between precision and human readability. You want it to be careful when handling percentages very close to 0% and 100%. For example, if there are 5 discrepancies out of 10 million, you do not want to show “0.00%” due to rounding, but you may artificially show “0.01%”. Similarly, if there are 9.999.995 correct rows out of 10 million, it’s not “100.00%”, but “99.99%”.
We pay a lot of attention to this kind of micro-copy because we want to reduce the migration stress as much as possible. You want to be able to quickly and reliably understand the current situation with a single glance at the output of your tool.
Non-transactional discrepancies
If old and new storage are in different databases, or even in different database systems, there is a possibility of spurious discrepancies. Suppose that we store photos in the filesystem, and now we want to migrate to S3. When we enable double-writing to both storages, it’s possible that the file is written to the filesystem, but is in the process of uploading to S3. If your tool would try to compare the data for this photo, it will report discrepancy. This discrepancy will go away in a few seconds, but you will waste time re-checking, and you will be stressed a little bit: was it a bug?
You may want to specifically handle this situation by either ignoring very recent data (if that is possible), or retry comparison of only the failed items after the main run has been completed. Again, this careful handling of “edge cases” is needed to prevent your stress: data migrations are often quite demanding.
Step 3. Double-write attribute values to both old and new storage
Now we want to begin writing new attribute values (for added and changed values) to both old and new storage.
This is where maintainability of your code base is going to be crucial. You need to find all the places where the attribute value is written and introduce a second write. Also, you mark all those places because later on (on step 8) you will have to do another pass and remove the code that writes to the old storage.
If your database access layer is well written you may find that you need to change only one function. But of course, for mature organically grown codebases even the process of finding out every single place where the attribute value is written to may take time.
As you change the attribute setting code, test it, deploy and run in production for a while, you will see that the data now starts appearing in the new storage. This is when you run the cross-checking tool again and confirm that the percentage of matching values begins to increase. This is a good sign.
To be continued 2…
In the next post we’ll discuss step 4, at least: implementing the tool for bulk incremental data sync.