Introduction to schema migrations
What is this substack about? Here is the list of best ideas from the first 25 issues.
Let us talk about schema migrations. To ensure that we cover the entire design space we need to carefully introduce our main abstractions. Let’s use the following three basic concepts that seem to cover almost everything that is stored in databases:
Anchors (also known as entities, or nouns): things like User, Project, Invoice, etc.
Attributes of anchors: things like “Name of the User”, “Description of the Project”, “Serial no of Invoice”;
Links between anchors: things like “Project is assigned to User”, “Invoice contains Items”;
Anchors, attributes and links together cover the logical schema. They could be implemented on a physical level in many different ways. The most straightforward way is tables and columns, but there are several alternative implementations (e.g. key/value storage, “schemaless” approaches, JSON, etc.), including non-relational storage such as S3 or filesystems.
Note that a single logical object can have two or even more physical implementations simultaneously. Such a situation most commonly arises during storage refactoring.
We will discuss physical level only as much as necessary: there is too much variability and technical details that distracts from the essence of logical schema and its migrations. But we’re going to talk a lot about the code that works with that data, and the software lifecycles that accompany database migrations.
Because we do not focus on tables and columns, we will mostly talk about the primary data. Primary data is defined as a source of truth: if we lose primary data, it is gone forever. Secondary data, on the contrary, could be reconstructed from primary data, and it’s usually introduced to increase performance. Dealing with secondary data is a topic for a separate series of posts, we’ll get back to that some time later.
Second axis of our table would be:
Adding new (anchor/attribute/link);
Changing physical storage;
Thus, we’re going to have nine (three by three) cells in our table.
This is a logical first topic to discuss in this series, because all anchors are independent from each other. Because our anchor is new (we’re just adding it), there are no attributes and links that depend on this anchor (yet). This basically makes the operation almost trivial. Now let’s see if we can manage to discuss this trivial case in less than a thousand words.
It’s important to keep in mind that we talk here about an anchor without any data attributes. Suppose that we begin with User. At this point there are no attributes such as User.name, User.date_of_birth, User.avatar, etc. The only special attribute that we’re going to discuss here is an ID, e.g. User.id.
Let’s now discuss the main concerns that anchors deal with. First, it’s naming. Each anchor is a noun, and we must agree on it beforehand. If we choose an incorrect name we’ll have all sorts of communication problems later on. If the name of the anchor is not well aligned with the business domain, you have to constantly keep in mind the mapping between the real world and the name, and that adds a bit of a cognitive tax. We’ll discuss the naming later.
Second, anchor IDs deal with the question of existence (and non-existence) of things. Basically, the question is: if there exists something in reality, how does it correspond with the ID in the table? This needs to be a topic for a separate post.
Third, we must decide on the physical implementation of IDs. Would it be an integer data type, or UUID? Which table would be the main table storing a list of known IDs?
Please note that anchor IDs are not equivalent to primary keys! There could be several side-tables, and only one of them would contain the list of known anchor IDs, others will be just the references to that list. At the same time, all side-tables will have a primary key.
Physical implementation of IDs is a fascinating topic, and alas, it too demands to be discussed in a separate post!
Anyway, in practice most anchors would be implemented as good old relational-looking tables with an ID column. For the User anchor this table would most certainly be called users, and it’s going to have a primary key integer auto-increment column called users.id. Let’s assume this implementation for now. I understand that having a single-column table is unusual and we just ache to start adding other columns that we would certainly need, but bear with me for a moment.
An interesting observation is that we could create this table for the new anchor at any time. In virtually all databases creating an empty table is basically one of the cheapest operations, and it could be done at any moment without any effect on server performance.
Also, creating a table for the new anchor cannot affect the existing system that runs on the database that we just changed. This is important because it hints at how we can begin treating the gap between the expected and actual database schemas. This gap needs to be constantly managed for any living system, and there are interesting and important failure modes and opportunities related to that.
Okay, the first chapter turned out to be challenging. On the one hand, there is not much to say: “TL;DR: create a table, duh”. On the other hand, there are at least five follow-up posts that are clearly visible here. And we’re past the target of 750 words, so let’s wrap up for now, and discuss adding attributes in the next post.