Requirements change: migrating either/or data schema

Mar 09, 2021

Since the previous post our substack has reached the first milestone: 100 subscribers. Thank you for your interest!

In the previous post we discussed some possibilities for representing either/or data in physical database schemas. Now we can investigate how we can make our schema extensible in both directions: a) adding new attributes and new branches and b) removing attributes and branches that are no longer needed.

Nothing is supposed to be groundbreaking here: we really just explore the design space!

In the “Concatenability of either/or data” we’ve investigated how JSON-based schema could be changed to accommodate the typical use-cases, and we’ll use here the same sequence of changing requirements.

As a reminder, initially we needed to handle two cases:

either you declare that you have no symptoms;
or you declare that you have some symptoms, and in this case you provide the list of symptoms that you observe;

We store the “yes/no” decision in the arrivals.has_symptoms field (where 0 means “no”, 1 means “yes”). The list of symptoms is stored in a separate two-column table arrival_symptoms.

Adding new piece of data

Let’s ask the additional question in case of “yes I have some symptoms”: what is the result of your PCR test? This is supposedly a yes/no answer, so it could be represented by another integer field, for example called pcr_test_result, where 0 means “negative” and 1 means “positive”. Obviously, this data could be stored in its own field of integer type. For the “no symptoms” case this piece of data is not relevant, so we should make this new field nullable:

CREATE TABLE arrivals (
  id INTEGER NOT NULL PRIMARY KEY,
  -- many other fields here
  has_symptoms INTEGER NULL,
  pcr_test_result INTEGER NULL
);

It’s possible that the test result was not provided: for example, we already have some database records that were inserted before we had this new requirement. In that case of course we can also use NULL value for pcr_test_result, the same that we would use for the “no symptoms” case. This is a curious little circumstance, let’s make a mental note about it.

Another question: could we merge both data pieces into a single value? We can see three possible cases:

no symptoms;
has symptoms, PCR test negative;
has symptoms, PCR test positive.

We discussed dependent and independent pieces of data in one of the previous posts. Here we have the typical situation: 3 out of 4 possible combinations have a meaning. If needed, we could write down the simple quadrant to prove that those two pieces are dependent.

So, the question is: what if we redefine the has_symptoms field to have the following values:

0: no symptoms;
1: has symptoms, PCR test negative;
2: has symptoms, PCR test positive.

Certainly, this encoding is pretty efficient: we can use a single byte to store two pieces of data, not one.

But we already know that here we may have a situation where PCR test results are not provided (yet). We could extend the list of allowed values above, but note something curious: if we would be designing this new structure (including the “result of PCR test” field) from the very beginning, we had a much greater chance of implementing it as a single field. Here, when we change the existing schema, we see that it may really be easier to just add a new field and not bother with entanglement.

Let’s make another mental note about that. Maybe the order in which requirements arrive should not affect the resulting design? One of our guiding principles is the principle of concatenability, and it seems that it may be manifesting here.

Adding top-level branch

Following the “Concatenability of either/or data” post, we introduce the third top-level option: “Exempt from submitting health data”. Thus we have three cases:

no symptoms (implemented as has_symptoms=0);
has symptoms (implemented as has_symptoms=1);
exempt from submitting health data.

How can we change our implementation to accommodate the new case? First thing that comes to mind is, of course: has_symptoms=2, right?

First problem with that solution: the name of the field (has_symptoms) begins to lose its meaning. If we would be designing this new structure (including the “exempt from” branch) from scratch, we would most certainly choose a different name for it, right?

This is a typical problem in database design: the meaning shifts when requirements change, but the persistent identifiers cannot be easily changed. Thus the gap appears. Is there anything that we could do here? Let’s make another mental note.

Second problem: the values we have chosen (zero and one) are very convenient to represent boolean values, and that’s why we chose them. Zero means false, one (or even non-zero) means true. So somewhere in our codebase there are code fragments like “if (has_symptoms) { … }” or “if (!has_symptoms) { … }”, or maybe even “if (has_symptoms) { … } else { … }”. If we introduce another possible value, we’re going to break all three fragments.

We could adjust our interpretation of this field by treating it as integers. Then we could semi-mechanically refactor the code to use direct comparisons: “if (has_symptoms == 1) { … }”, etc. This is fine, though it feels a little bit too low-level. Could there be some higher-level treatment of this general problem? Let’s make another mental note.

In practice boolean data type often eventually becomes insufficient and requires the addition of one or more extra cases, like here. This change is tricky also because booleans are such a nice and natural data type for software development, so by the time this change is needed, your code is full of pretty little statements like above. I considered tackling this topic in the current issue, but I feel that it deserves its own chapter: treating booleans as a subset of either/or data.

Removing piece of data

Following the “Concatenability of either/or data” post, we now ask what would happen if we decide that we do not need to keep the list of symptoms, just the fact that there are symptoms.

In our current schema the list of symptoms is kept in a separate two-column table, arrival_symptoms. We could 1) stop writing to it, 2) stop accessing it completely, 3) wait a bit and then drop the table. The list of symptoms is the personal health data, so we really should think through the entire lifecycle of this data, up to its purging.

Our schema handles this case quite cleanly, but let’s revisit some potentially problematic decisions we could have made. First, it’s tempting to declare that the “has symptoms” question could have been reformulated by “are there any rows in the table of symptoms?”. That would have created an entanglement between the “has_symptoms” attribute and the “list of symptoms” array-typed attribute. It would work perfectly, but would require some refactoring during the removal of this data. Worse, we could decide to implement it by 1) introducing a fake “symptom” that exists only to support the existing queries, 2) adding this symptom to every arrival that has any other symptoms and 3) removing all real symptoms.

This is interesting. It looks like there are cases that look like data duplication, but maybe they’re not. We could confirm that by the usual trick: imagine what happens if requirements would have arrived in a different order. First we have a schema that only records the value of a “I have symptoms” checkbox; this requires an integer field. Then, a few weeks later, we decide to also gather the information about symptoms, and we create the two-column table storing this data. Would we then go and decide to kill the original integer field? Or would we allow our developers to treat both data sources as equivalent? This is really a non-trivial topic that we need to investigate.

It really seems that there is a little landmine under every tiniest decision. How do people even have courage to design databases, really?

One actionable thing that we got out of this discussion is the following principle: Final design should be the same no matter what was the order of implementing requirements.

For now, we have at least four new fractal directions for the future posts:

booleans vs either/or/or data;
naming and changing of requirements;
semantics of NULL values could be ambiguous;
more structured and less low-level handling of either/or data.

Also, we need to begin discussing how to change the complex values of either/or/or data type. Basically, which insert/delete/update statements do we need to execute to change from “I have no symptoms” to “I have symptoms: cold and temperature”? But before we can talk about that, let’s talk about erasure.

Minimal Modeling

Discussion about this post