How to store either/or data in the database

Structural data types: Introduction to design space of physical representation.

We’ve been discussing the database modeling aspect of either/or data for quite some  time now.  The finish line is in sight now!

Now we’ll go from the logical representation to the physical representation.  The next few posts would mostly be useful to software engineers.  If you’re, say, a business analyst and you prefer to stay within the logical representation, the good news is that it’s not going to change any more.  You can even skip this post, unless you’re interested in the engineering side of the problem.

A rough outline of the next few posts:

  • [today’s post] Propose a plausible physical schema that more or less matches our requirements;

  • Investigate all the alternative implementation and discuss the consequences of all potential changes;

  • Summarize the design space;

  • Extract actionable database modeling advice.

General approach

First we need to settle down our goals and non-goals.  In the previous post we classified either/or data as a structural data type.  We’re going to be discussing all of them eventually, and we need to understand beforehand: what are we trying to achieve?  How do we choose the optimal physical representation for our case?

Let’s draft a list of requirements.  First, we’re focused on collecting primary data.  It means that the raw performance of different queries is generally going to be discussed separately.  Of course, most of the time (the proverbial 80%) the typical queries on our stored data are going to be perfectly fine.  If you don’t actively fight the database, it’s going to take care of that.  For the remaining cases you would need to find a specific, performance-optimized secondary representation.  It is outside of our scope for today; let’s just make sure that we don’t fight the database.

Second, our representation needs to be concatenable.  Either/or data, for example, could be represented as a number of branches and a number of value attributes in each branch.  We want to be sure that we could easily change the physical schema to accommodate changes in logical schema.

Third, we want our representation to be minimal.  Roughly speaking, we want to write to the disk as little bytes as possible, unless other concerns override this rule.

There is no way that there could be only three requirements in this list.  Hopefully we’ll think of something else later.

Setting the scene

Of course, we’re going to continue analyzing the same “health declaration example”.  Let’s revisit it again:

Suppose that we want to register the health declarations of incoming passengers.  They must fill in the form that basically asks:

  • Do you have any symptoms of COVID-19 from the list below?  

    • If no, tick here: [ ]

    • If yes, mark all that apply: [ ] cold symptoms; [ ] coughing; [ ] elevated temperature.

So, for each passenger we need to store that they said “yes” or “no”, and if yes, we need to store the list of symptoms.  If they avoided submitting the form we must be able to detect it.  Also, we must make sure that we prevent the incorrect case: both “no” and a list of symptoms, which would be ambiguous.

We do not design in vacuum, so there is probably already a database of landing passengers with all kinds of data about them: arriving flight, name, nationality, reason for visiting, customs declarations, etc.  So, there must be some kind of entity (aka noun, aka anchor) that corresponds to each fact of arrival.

Filling in the declaration is mandatory, but that does not mean that our attribute would always be there. When the pandemic started, we needed to add the health declaration as defined above.  But there are already many arrivals in our database that happened before the pandemic!  Also, there could be very real situations when the declaration was not submitted: for example, the landing passenger was physically unable to do so due to illness, or they did not cooperate with the law.  So, we must clearly distinguish between the attributes that were set to some value, and attributes that just were not set.  (Roughly speaking, this is what NULL would do for simpler attributes).

Tables schema (draft)

We can imagine that the main table for arrivals would probably be called arrivals.  It has a primary key, called arrivals.id.  There are probably many other fields in this table.

We already did a useful preliminary modeling exercise in the “Concatenability of either/or data” post. We discussed there how to represent this data in JSON-like structure, and we considered the following encoding:

  • {“has_no_symptoms”: {} } for “declared no symptoms” case;

  • {“has_symptoms”: {“symptoms”: [“cold”, “coughing”]}} for “declared symptoms” case;

But let’s just show the prototype tables (in pseudo-SQL):

CREATE TABLE arrivals (
  id INTEGER NOT NULL PRIMARY KEY,
  --
  -- ... other fields go here ...
  --
  has_symptoms INTEGER NULL
);

CREATE TABLE symptoms (
  id INTEGER NOT NULL PRIMARY KEY,
  symptom_name VARCHAR(50)
);

CREATE TABLE arrival_symptoms (
   arrival_id INTEGER NOT NULL FOREIGN KEY arrivals.id,
   symptom_id INTEGER NOT NULL FOREIGN KEY symptoms.id
);

Example data

In the beginning we need to add several rows to the symptoms table, such as:

| Table: symptoms  |
|id | symptom_name |
| 1 | cold         |
| 2 | coughing     |
| 3 | temperature  |

Those symptom IDs would be used later.

We have three main scenarios: I) the attribute has not been set; II) declaration was submitted: no symptoms; III) declaration was submitted: has coughing and temperature.   Let’s see the data that would be added to the tables in each scenario (the ID of our arrival would be 123456).

I. The attribute has not been set:

| Table: arrivals       |
| id     | has_symptoms |
| 123456 | NULL         |

The other table, arrival_symptoms, has no new rows.

II. Declaration was submitted: no symptoms

| Table: arrivals       |
| id     | has_symptoms |
| 123456 |            0 |

The other table, arrival_symptoms, again has no new rows.   We represent the boolean false value as zero here.

III. Declaration was submitted: has coughing and temperature

| Table: arrivals       |
| id     | has_symptoms |
| 123456 |            1 |

and

| Table: arrival_symptoms |
| arrival_id | symptom_id |
|     123456 |          2 |
|     123456 |          3 |

In the arrivals table we represent the boolean true value as “1”, which is the standard approach to encoding boolean values in common databases. “2” and “3” here are the numeric values for coughing and temperature, taken from the symptoms table, see above.

Now what?

The schema presented above is basically random, there is nothing special about it.  We did not spend a lot of time on designing it: we just wrote down almost the first thing that came to mind.  What we can do now is to try and see what would change if we make other physical modeling decisions.  Will the schema become easier?  More performant?  Will some changes make the schema harder to understand?  Will it be more extensible?

We can see about a dozen of changes that we can potentially make.  In the next post we’re going to make a sort of impact evaluation for all of those changes.  Then we’ll summarize what we’ve learned during this exercise, and try to outline some actionable database modeling advice (which is the ultimate goal of this substack).