How else to store either/or data
Main table, side-table, per-attribute table.
In the previous post we started discussing a physical representation of either/or data. Our first database design is simple and straightforward, but our idea is to explore the design space, so let’s discuss some alternatives. If you’ve just started reading this series, I’d suggest you skim the previous issue for the introduction (or even begin from the archives).
So, we have the arrivals table, and we’ve added the nullable field has_symptoms. We did it because the arrivals table maintains primary keys for the passenger arrivals, so adding a field directly to this table seems to be pretty obvious. But we have two more options:
a) create a separate side-table that will store data for this particular use-case (storing the health declaration data);
b) create a separate narrow table just to store this particular attribute.
There are some downsides to adding more and more fields to any particular table. If you do that long enough and your business logic is complicated enough, you may end up with a lot of those fields. Some real-world schemas may reach a few hundred columns in a single table. First, this is going to be quite hard to understand if you’re new to the code base and to the database, and there is no sufficient documentation. Many people expect to be able to read the table definition and get at least some sense of its purpose and the meaning of its columns. If you have 500 lines, just to read them all would require quite some time.
Second, in some databases adding a new column to the table may be an expensive operation, and it could require extra disk space.
Third, if your database uses row-based storage (as opposed to column-oriented store), the data on the disk is going to become less and less dense. This may decrease I/O efficiency for queries, including the queries that do not touch the new field at all. Basically, some queries may change the I/O patterns, and this can lead to noticeable performance degradation. Database servers go out of their way to mitigate this kind of problems for you, and there is a chance that yours would do a very good job to handle this, but this is still something that we must keep in mind.
By the way, we’re going to be talking a lot about pros and cons of different design decisions. Let us acknowledge from the very beginning that in practice the cons are very rarely the dealbreakers, and pros are very rarely decisive. They’re both just additional weights that can influence the engineering balance. The only actual deal breakers are usually various compliance requirements, everything else is manageable.
Anyway, another option we have is to create a table that would store the data directly related to health declarations. Let’s call it arrivals_covid, and let’s move the has_symptoms field to it:
CREATE TABLE arrivals_covid ( id INTEGER NOT NULL PRIMARY KEY, has_symptoms INTEGER NULL -- some other fields would go there );
If we would be working more on that use case, we’d expect to add more fields in this table, up to some reasonable number, maybe half a dozen or a bit more. The primary key of this table would reference the primary key of the main table, arrivals.id.
If our software system uses this approach to data modeling, we probably have a number of such side-tables, roughly corresponding to some area, topic or a project. Fields in those tables are more or less logically grouped: e.g., you would have fields related to payments, to the personal data, to the administrative stuff, etc. However, if you think about this, it’s not always clear where exactly each new field would go.
Also, almost always your use case would require some data that already exists in a different side-table (or in the main table). For example, for the purposes of COVID tracking we would most certainly need the information about the country of origin. It is definitely there, in one of the other tables, together with some other data. The question is really: are those side-tables a good way of grouping fields or are they just essentially random ad-hoc buckets? This is a fascinating topic, and we’re going to discuss it some time.
Narrow single-attribute table
What if we store every single attribute in its own table? That table would be narrow: one field for the primary key and one field for the data value (and maybe a couple of fields for timestamps and such). After this table is created, its structure would virtually never change after that: any other fields would go to their own tables.
So, for this attribute we would have:
CREATE TABLE arrivals_has_covid_symptoms ( id INTEGER NOT NULL PRIMARY KEY, has_symptoms INTEGER NOT NULL );
This kind of table would only have a row if the attribute was set. This allows us to make the value field (has_symptoms) non-nullable (compare with the side-table schema). The id field also refers back to the arrivals.id primary key.
This approach is interesting, with many strengths:
it’s very clear where to put each attribute: in a separate table;
the values are stored quite efficiently, particularly because only the set attributes are stored, and we don’t need to track NULL values;
the data on the disk is stored densely, so the disk access patterns are favourable;
you can create new attributes without affecting any of the previous data: the access patterns do not change;
And weaknesses too:
this approach is very unfamiliar to people; it’s an important concern because it encodes some kind of folk knowledge that we must make more explicit;
for every attribute value we store a value for the primary key. For small attributes, such as boolean, it means that instead of roughly 1 byte we would be using roughly 5 or 9 bytes (if our primary key is 32-bit or 64-bit);
writing SQL to query such kind of tables could be very tedious, especially manually;
execution plans for the queries need to be analyzed and confirmed to be acceptable;
(Both lists are far from complete, of course.)
Note that almost everything that we’ve discussed so far is not really about either/or data, it could be applied to any kind of attributes, pure or structural. But we’ve touched on many themes and ideas that need to be discussed separately. For now I’m going to wrap up this post, but let’s talk a little bit about naming.
Let’s trace how we’ve named our tables in fields in all three approaches. First we assumed that the table is called arrivals, and the field is called has_symptoms. Then we created the table called arrivals_covid. For the third table we’ve chosen the name arrivals_has_covid_symptoms. I intentionally did not pay a lot of attention to naming, just choosing whatever felt appropriate at the moment.
What we can see from those names is the interplay of local context and global context. We’re here now today, working on this COVID stuff. That’s why for us the name “has_symptoms” makes sense: of course we mean “symptoms of COVID-19”. But how would that name sound in a few years? I don’t want to suggest that there could be another event of a different pandemic with different symptoms, but it’s entirely possible, right? We would have to be more specific in naming the field then.
Also, note that when we lifted the attribute to the global level, the table name immediately became longer: “arrivals_has_covid_symptoms”. It would feel strange to have the table called “has_symptoms”, right? At the same time it feels perfectly normal to have a field called “has_symptoms”, because the context is provided by the name of the table, e.g. “arrivals_covid”.
We’ll discuss contextual naming in one of the future posts, too.
In the next installment we’re going to discuss extending and reducing the physical schema for either/or data, based on some insight garnered from the discussion in the “Concatenability of either/or data”.