Persisting either/or values, or The uselessness of CHECK constraint

Mar 23, 2021

In the previous post we’ve discussed an idea of erasure. It is directly related to our next topic (and hopefully, the last one in the either/or series): how to store the values of either/or type in a typical database.

It is well known that backups are very easy until you need to restore. Storing values is similar: the main question is how to read the stored values. It’s quite easy if the values are simple (for example, they have one of the pure, scalar types as defined in “Structural and pure attributes”). But we’ve chosen either/or for this journey specifically because it sits in the sweet spot, complexity-wise, so the question of reading vs writing is non-trivial.

Update: this turned out to be the longest issue so far, more than 2000 words.

Logical and physical representations

Continuing with the same health declaration example, we’ll return to a simple either/or data structure that we’ve discussed in the previous post (written in pseudo-code here):

HasSymptoms (symptoms: array[]) |
HasNoSymptoms (pcr_test_result: boolean)

Two possible values of this high-level representation are, for example:

HasSymptoms (symptoms: [“cold”, “coughing”])

HasNoSymptoms (pcr_test_result: false)

Those values are always correct, because there is no way to construct the incorrect value. For example, “HasNoSymptoms (symptoms: [“cold”, “coughing”])” would not be accepted by the parser, because if you begin with the “HasNoSymptoms” branch, the only allowed inner value is “pcr_test_result”.

For the physical representation we’ve chosen a typical two-table organization of those values:

CREATE TABLE arrivals (
  id INTEGER NOT NULL PRIMARY KEY,
  --
  -- ... other fields go here ...
  --
  has_symptoms INTEGER NULL,
  pcr_test_result INTEGER NULL,
);

CREATE TABLE arrival_symptoms (
   arrival_id INTEGER NOT NULL,
   symptom_id INTEGER NOT NULL
);

We already know that there are several alternative approaches, but for this discussion this one will do.

Ideally, the physical representation is supposed to match the logical one in terms of “correctness by design”, but unfortunately, our typical database system does not know about the logical representation, it only knows about the tables. Thus, erasure, as discussed in the previous post, manifests itself.

We write to the database according to the logical representation, but it exists only in our heads. For example, to store the HasSymptoms (symptoms: [“cold”, “coughing”]) value, we insert one row in:

| Table: arrivals       |
| id     | has_symptoms |
| 123456 |            1 |

And two rows in:

| Table: arrival_symptoms |
| arrival_id | symptom_id |
|     123456 |          1 |
|     123456 |          2 |

If we read the data from those rows back, we’d be able to reconstruct the original value unambiguously. But our database would not prevent us from, for example, updating the row for id=12345, setting has_symptoms=0. How then would we interpret the data? Should we ignore the dangling rows in the table arrival_symptoms, or should we assume that they are “more important” and assume that it’s “has_symptoms” having incorrect value?

If we would encounter rows like this in the real database, you could make convincing arguments for both possibilities. You can analyze the code to find out the logic behind the the bug that affected that data, or you could estimate the amount of incorrect data and maybe just ignore it or clean it up, or you could decide to use the conservative approach and treat all incorrect data as “re-test all passengers affected by this bug”. But of course there are scenarios where such discrepancies would not be handled so relatively easily. So, it would be better if we try to reduce the chance of getting incorrect data in your tables.

CHECK constraint is useless, really

We really need to get this one behind us as soon as possible. There seems to be some sort of commonly held belief that the CHECK constraint is somehow useful for anything at all. It is not, even for simpler data types, and it’s hopeless for either/or.

Just to take this from our shoulders, let’s list the observations that we can make:

CHECK constraint is defined only for “traditional” relational databases that have roots in the ANSI SQL; many contemporary databases don’t even have a way to express this constraint;
CHECK definition is parsed and ignored by MySQL, which is one of the widely-used traditional relational databases; it is true even for the current version 8.0;
CHECK only works for the values from the single row of a single table. So, even if we try to define a CHECK constraint for the value of arrivals.has_symptoms, there is no obvious way to count the number of corresponding rows in the arrival_symptoms;
In principle, you can use something like a stored procedure that is going to receive the values from the current row and then it is able to query other tables. Postgres seems to allow this, but it has some suspicious wording in the documentation:

PostgreSQL assumes that CHECK constraints' conditions are immutable, that is, they will always give the same result for the same input row. This assumption is what justifies examining CHECK constraints only when rows are inserted or updated, and not at other times. (The warning above about not referencing other table data is really a special case of this restriction.)

Ugh, I’m not sure how to interpret this. I read it like this: PostgreSQL reserves the right to memoize the constraint check for the given values of row, and not to re-calculate them even if the other tables are updated in a different transaction. So you could have a classic time-of-check-to-time-of-use race condition that can lead to the incorrect data.

Finally, CHECK constraints only work for inserting and updating rows, not for deleting. So it’s possible to delete all the rows from arrival_symptoms and, again, corrupt the data.

Overall our approach in this substack is that we keep in mind the most non-trivial case. For example, both our tables could be in different databases or maybe even use the different server implementations. So we cannot ever hope to be able to control the data that way, we need to find some other way.

Triggers

Okay, the next thing that comes to mind is user-defined triggers. They’re much better in a sense that you can make them validate all the data in a way that we expect, and all cases of incorrect data would be prevented (or shall we say: “only the correct data is going to be allowed”?). However, there are still a number of issues that we need to be aware of.

First, it’s obvious that our database server actually must support triggers.

Second, as we mentioned in the previous section, our tables could be in different databases (or database schemas), or even in different database server implementations.

Third, we would need to pay quite a lot of attention to the deployment of those triggers across our fleet, and to the question of change management. Triggers are code, and code needs to be version-controlled. You must be able to deploy the change, and to roll it back in case of problems. Typical database migration management tools focus on changing the structure of the tables, but updating the triggers is a very much different task. Also, in typical distributed database installation there will always be instances that are not available at the moment, due to networking or hardware failures, so we need to properly prepare them when they’re back online. This is not easy, and a failed state could be tricky to detect and fix.

Finally, there is an interesting concern regarding the performance. Most of the time our writes are just going to be correct. We would always have that nagging feeling that we’re wasting resources on doing the unnecessary queries and computation. Of course, correct is more important than fast (because we’re concerned about primary data here), but maybe there are ways to have both?

Single way of truth

There is a principle of a single source of truth: every piece of data must be stored in a known location, and only once. So if you want to know its value you have to consult the source of truth.

To write structured data values such as either/or data, we can define a single way of truth: create a single piece of code that writes out a value. It would be a function that receives an argument of either/or data type, validated by a compiler, or just by the runtime assertions, carefully tested. It needs to update all the tables that participate in storing all possible values of that type. It can change any value to any other value (so for example it can change HasSymptoms (symptoms: [“cold”, “coughing”]) to HasNoSymptoms (pcr_test_result: false)). (You will also need to implement adding and deleting values of this data).

After that, do not allow any other way to change this data in your codebase. This of course requires some control over your code and over your system. If you have some legacy code paths that are able to arbitrarily change the data by directly updating tables (even if they try to preserve correctness), you will sooner or later have to deal with incorrect data, see the next section.

The code that implements this write needs to set all the fields and tables to the known state. The fields that are allowed within the branch will be set to the provided data values, all the other fields need to be reset to an initial state (such as NULL).

For example, setting the value to HasSymptoms (symptoms: [“cold”, “coughing”]) for id=123456 would require the following SQL operators:

BEGIN TRANSACTION;
UPDATE arrivals SET symptoms = 1 WHERE id = 123456;
UPDATE arrivals SET pcr_test_result = NULL WHERE id = 123456;
DELETE FROM arrival_symptoms WHERE id = 123456;
INSERT INTO arrival_symptoms (arrival_id, symptom_id) VALUES (123456, 1), (123456, 2);
COMMIT;

Resetting this value to HasNoSymptoms (pcr_test_result: false):

BEGIN TRANSACTION;
UPDATE arrivals SET symptoms = 0 WHERE id = 123456;
UPDATE arrivals SET pcr_test_result = 0 WHERE id = 123456;
DELETE FROM arrival_symptoms WHERE id = 123456;
COMMIT;

In this example we assume the transactional database, and we update the value inside a transaction. If we have more than one database, it’s not possible to reach the atomicity, so we would just need to accept that our system does not provide this guarantee, and design the rest of the code accordingly.

Speaking of transactions: it’s important that all the tables are updated in the same order for all updates. So, in our examples we update all of the “arrivals” table, then we update all of the “arrival_symptoms” table. This approach will prevent a lot of deadlocks. Fortunately, this requirement is easy to implement.

Handling incorrect data

It makes sense to always be prepared to handle the incorrect data. You would either have it because of the legacy code that updates the tables directly, or you will eventually have bugs that will slip a number of incorrect values in your database anyway. Most probably, the bugs will manifest when the specification of the either/or type is going to be extended or reduced, thus the code will have to be changed. No matter how much we test the code, the question is not if, the question is when.

So, we did a manual query to investigate the complaint from the user, and we see the rows that do not make sense together. From practical point of view we can:

treat the data in a conservative way, for example assume that all incorrect data is treated as the “more dangerous” case;
we can just remove the incorrect data values if it’s not too important; also, we can re-request the human processing of this data, if there is a way to reprocess it and if it makes a business sense;
if the data was broken because of the code change, we can try and find a logic behind the bug, and maybe recover the incorrect values into correct values;
if we have some kind of incoming stream of data then we can just reprocess that stream if it still has the original data change requests;
we may extend the specification of our data values to incorporate the incorrect combinations and then deal with those combinations in the code, for example:

HasSymptoms (symptoms: array[]) |
HasNoSymptoms (pcr_test_result: boolean) |
IncorrectHasNoSymptoms (symptoms: array[])

This will allow us to work with that kind of data in a more structured way, hopefully preventing further bugs. Just don’t forget to remove the temporary branch when the data issue is fixed.

Okay, this issue is the longest so far. I knew that this topic was not going to let us go so easily. I really want to talk about something else next time, maybe about game theory in database modeling.

Minimal Modeling

Discussion about this post