Migrations: migrating attributes, pt. IV

Dec 01, 2021

What is this substack about? Here are the highlights from the first 25 issues.

Welcome to Part IV of the “Migrating attributes” series. In the first three parts we made a brief introduction, established a 10-step procedure of the migration (reproduced below for convenience), and discussed the first four steps. Today we’re going to discuss step 5.

Prepare the new physical storage (e.g. create a table column);
Implement cross-checking tool that compares old and new storage;
Make application code that sets attribute values to double-write to both old and new storage;
Implement the tool for bulk incremental sync of attribute values from old to new storage;
Make application code that reads attribute values to read from both old and new storage, with the following substeps:
1. compare the results and use values from the old storage;
2. compare the results and use values from the new storage;
Make application code to no longer read the old storage;
Prepare old storage for stopping writing to it;
Make application code to stop writing to the old storage [this is the point of no return];
Clean up a) cross-checking tool; b) sync tool;
Get rid of the old storage;

Step 5. Read attribute values from both old and new storage

On this step a) our code double-writes new and updated attributes to both old and new storage, and b) we’ve successfully run the bulk sync tool, so both storages are supposed to be in sync. Yet, we don’t know if we found all the places where the data could be changed, and so we need to wait and observe a bit more.

At the same time we can already begin migrating the reading code from old to new storage, but carefully. Our next goal is to find all the places where the attribute value is being read and change them.

There are two main reasons to read the attribute value: for filtering and for displaying. Suppose that we used the status field of the projects table, and we want to migrate this field to a separate boolean attribute table. So, everywhere where our SQL statements used: “WHERE projects.status = “active””, we want to change it to “WHERE projects.id IN (SELECT id FROM projects__active)”.

Here we’re not sure if values of projects.status field and the existence of the row in projects__active table are strictly in sync. What we can do is that we can execute both SQL statements and compare the results. Initially we always return the results from the old storage. If both results match, we just throw away the results from the new storage. If results do not match, we emit some kind of warning that would be visible in the monitoring system, and that provides the evidence that somewhere in our code there is a place that was not migrated to double-writing (or at least it was not migrated properly). We must now return to step 3, and find the overlooked place. When we find the reason for the discrepancy, we fix it, deploy again, re-run the bulk sync tool (see step 4), and repeat the observation cycle.

If you see such warnings, you can run the cross-checking tool (see step 2) and confirm if the discrepancy is here. If it doesn’t report any, you may have a bug in the code that reads from the new storage (or a bug in the cross-checking tool). Make sure that the warning includes the information on how to reproduce both queries.

The number of times the cycle would be repeated depends on how easy it is to navigate and change your codebase. Of course, the more pervasive an attribute is, the harder migration would be. If the attribute is old (and the codebase is mature), you can expect to overlook a couple of places that update the attribute in an unusual way. One of the trickiest ways is all kinds of ad-hoc meta-programming that constructs SQL queries in the indirect way.

Your safety net here is the warnings on reading, and the diagnostics from the cross-checking tool. There would be, however, one more chance to catch the missing field: on step 10. However, by that time we would already be after the point of no return in terms of data loss, so it’s better not to allow things to proceed that far.

Performance aspects

By reading from both storages we achieve another important goal: we can test how the new storage behaves under the real load. Of course, the performance of relevant parts of the system is going to drop somewhat, because you’re doing two reads instead of one. But in most of the cases the migration process will only take a limited time, and this should be acceptable in most real-world systems.

If your system degrades under the double-read load, you need to rethink or fix your new storage. For complicated migrations it is recommended to use some sort of kill-switch functionality that allows you to disable double-reading without deployment.

Step 5b. Using the results from the new storage.

When the system runs reliably, does not emit warnings and the cross-checking tool no longer reports errors, you can switch the reading code to use results from the new storage.

At this point you can disable reading from the old storage, or continue double-reading if you’re more risk-averse. In principle, if you continue double-reading for some more time, you increase the window where the errors could be detected (by users and by other components of the system). Basically this protects you from bugs in the code that compares two results.

Note that at this point you’re not removing the code, but just flip another feature-switch. The code itself would be removed on the following step.

In the next issue we’re going to begin discussing various flexible-schema approaches to data modeling, and the reasoning behind them. The remaining five steps in the “Migrations” series would be discussed after that.

Minimal Modeling

Discussion about this post