Migrations: adding attributes
What is this substack about? Here is the list of best ideas from the first 25 issues.
In the previous post we started a long discussion of database migrations. We tackled the easiest case first: adding anchors (or entities, or nouns). Now we can discuss adding attributes. Again, we begin with discussing the logical schema, and then continue with one particular aspect of physical implementation.
Attribute is a piece of data about the anchor, such as:
What is the name of the user?
What is the status of the payment?
Is the post active?
How much does this item cost?
Most of the time attributes are represented physically by some kind of a column in a relational database, but there are other popular implementations, such as key/value storage, “schemaless” approaches, JSON, and even non-relational storage such as S3. For example, you can represent an attribute “avatar of the user” by an S3 object that contains a JPEG file. Attribute data types are discussed in the “Structural and pure attributes” issue.
To create a new attribute, we need to have an anchor. So, we have a structural dependency between attributes and anchors.
All attributes are structurally independent: you do not care about any other attribute if you want to create a new one. However, there are two other ways of how different attributes can depend on one another.
First, you can have a data dependency between attributes (and also between attributes and links). This topic was extensively discussed: see “Either/or/or” in the Table of Contents of this substack. In this issue we’re not going to talk about this, for clarity.
Second, you can have a physical dependency between two attributes.
Physical implementation
As we mentioned in the previous post, you can almost always create a new anchor without affecting the behaviour of the system in any way. But with attributes the story is a bit different.
When you add an attribute, you need to choose a table where this attribute is going to be stored as a column. This is particularly relevant in the traditional relational environments. (We’re not going to talk about various schemaless solutions this time though: they introduce their own interesting challenges, worthy of a separate post.)
You have three possibilities: 1) add a field to the main table where the anchor is stored; 2) add a field to a side table (creating this side-table, or reusing one of the existing ones); 3) create a two-field per-attribute table. The thinking process behind this is discussed in the “Game theory in database design” post, where we introduced the concept of The Game of Tables.
In the first two cases, when we add our attribute to a table that already has one or more attribute columns, a physical dependency between those attributes appears. Logically those attributes are independent, but something interesting happens here: the behaviour of the system may change after the attribute was added, even though it is not used by the system in any way.
In many common database server implementations, mainly the ones using so called row-based storage of table data, the data layout may affect the query performance. Suppose that we have a users table with one user.id column and one integer column that contains, say, year of birth. If user.id occupies 4 bytes, and year of birth also occupies 4 bytes then on a single disk page of 4096 bytes we can store roughly 512 rows. (The actual math is more complicated, but the general point is valid).
Now suppose that we add another field with the username of the user, containing up to 16 characters. Now a single row requires 4 + 4 + 16 = 24 bytes, and a single page of 4096 bytes can store only around 170 rows, three times less than previously. It means that to fetch the list of years of birth we need to read three times as much disk pages than previously, and we need three times as much memory to process this data.
In practice this rarely causes any substantial problems, because both operating system and database server go to extreme lengths to shield you from such concerns. But sometimes this may cause performance issues: there are many interesting stories about that all over the internet.
Anyway, the point here is that physical dependencies between attributes may cause a change of system behaviour, even though the new attribute is not yet used by the system. This property is somewhat undesirable, because it slightly violates the concatenability principle. We would prefer if adding new features would not affect the existing features.
Another interesting point is that if you continue this line of reasoning you can see that there is no such thing as required attribute (more on that later).
Further directions
Adding an attribute is relatively easy. It will become interesting when we’ll want to get rid of the attribute and the data that it contains. It would turn out that we should have done some preparations before adding the attribute to make removal easier. We’re going to discuss this a few posts later, and we’ll get back to this post, tying one of the knots.
In the next post we’re going to discuss adding links between anchors.