Migrations: getting rid of attributes

Aug 31, 2021

What is this substack about? Here are the highlights from the first 25 issues.

Adding something to a computer system is relatively easy. Removing it is often more difficult, and more interesting. I’d even say that to better understand a technology or a component, you need to think how it would be removed.

In the previous three posts we discussed what happens when we add three basic concepts of our modeling approach: anchors, attributes, and links. First, we need to discuss the concepts in the order of dependencies. Both attributes and links depend on anchors. Second, attributes are more interesting than links. So let’s talk about getting rid of attributes.

Attributes are independent from each other, so for the purposes of this discussion we’re going to discuss removal of a single attribute. In practice you may sometimes want to delete a number of related attributes, optimizing some of the processes discussed here. Also, we will be solving the complete engineering problem: after we’re done, both code and data needs to be completely cleared of the mentions of the attribute: only the version control history and backups will remain.

Because we work on a live system, we must strive to reduce the outage window in case we make a mistake. That means that we must be able to revert every step along the way, more or less quickly. The point of no return should be as close to the end as possible.

Brief plan of the attack:

Incrementally remove all mentions of attribute from the codebase;
Deploy updated application;
Remove data;

Cleaning up the codebase

We need to find and eliminate all places where a certain attribute is read and written to. What shall we remove first: reading or writing? Both variants are possible, and both make sense in different scenarios. Sometimes we want to first stop reading completely, deploy the intermediate version, make sure that it works, and then stop writing. Sometimes we can stop writing, and then work on removing the parts that refer to the deprecated attribute.

Finding each and every place where a certain data attribute is mentioned could be challenging. If your code is written exclusively in a statically-typed programming language then the compiler will be of great help. But in many codebases you will find a number of places where you need to find the relevant pieces of code manually: for example, if your code executes direct SQL statements, or if your codebase is written in a more dynamically-typed programming language.

Most often your attribute is represented as a column in one of the database tables, but it’s also possible that it’s implemented as a key-value or something similar. Anyway, if our static tools cannot reliably help us find 100% of cases then we’ll have to resort to textual searches, such as grep. This is where we would begin regretting some of the naming choices that were made when we (or somebody else) created the attribute.

Before GDPR went into force, some companies went through the frantic process of codebase refactoring, looking for PII-containing database attributes and making sure that those are handled correctly (for some definition of “correctly”). Personal names obviously contain PII (personally-identifiable information). Good luck searching for the “name” string in 40 million lines of code!

If we’d be tasked with designing attribute names to solve the problem of eventually finding all possible mentions of the attribute, the first thing that comes to mind (ignoring esthetic concerns) is that the name of the attribute must be globally unique, for example: “attr_User__name”. This name is composed of a) Hungarian notation for type of thing: “attr”; b) name of the anchor: “User”; and c) name of the attribute: “name”. Humans may have differing opinions on this naming choice, though.

What’s important about this naming scheme is that it is regular and it could be applied without additional consultation, if this convention is agreed upon. Of course we could just try to choose some globally unique names by inventing more and more interesting variations (at least at this point we should consider not to be using such loaded and ambiguous names as “name”). But such variations have a risk of introducing semantic difficulties, sometimes leading to quite long (more than four words), albeit human-readable names for some things. Also, this approach would require keeping a catalog of such names to prevent duplication, and keeping this particular kind of catalog is not easy.

Anyway, one way or another we’ve been searching across the codebase for our attribute and removing lines of code that are no longer relevant. After the code is changed, it goes through some kind of testing, which we won’t be talking about here. Next step is deployment, post-deployment monitoring, and possible revert in case of any problems with the change. This may happen several times while we work on removing the attribute: we really want to encourage incremental approach for all database modeling activities.

Removing data

The ultimate way to test that our code does not use the attribute is to remove it and see if anything breaks. But we need a way to revert this in case anything does break.

If we’re using the ordinary relational representation of an attribute as a column of some table, we can rename that column to a temporary name. For example, “name” column could be renamed to “TO_DROP_name”. Renaming a column is an atomic operation in most databases, so we could easily try this, and just as easily rename the column back if our monitoring system detects a spike of errors related to this attribute.

After some time has passed without errors, we can proceed with physically deleting the column. Not only does it reclaim space and prevents accumulation of technical debt, but it also again confirms that our tools and procedures work. Do not miss the opportunity to test them on less-important data, or maybe assign this task to somebody less experienced, to let them gain familiarity with the process and tools.

Next week we’re going to continue discussing removal. We’ll see what could be said about links and anchors in this context.

Minimal Modeling

Discussion about this post