Sentinel-free schemas: a thought experiment

Dec 20, 2023

In this post, we’ll do a thought experiment: try to systematically design a database schema that does not use NULLs. But we’ll also go even further: we’ll avoid sentinel values altogether. Let’s go along with the idea and see where it would lead.

Historically, the question of NULLs was quite contentious. There are three main problems:

NULLs (“absence of value”) are somehow not very compatible with relational model;
NULLs introduce additional complexity to all conditional operators: if you calculate “A * B” where “*” is any operator, and A or B is NULL then the result would sometimes be “UNKNOWN”, and not true/false;
It’s not clear what human-readable semantics of NULL is (like, is it unknown, not provided, not applicable, expunged, legally refused to provide, etc., etc.)

Here is a typical blog-post written by somebody with strong opinions about NULLs and the relational model: https://www.oreilly.com/library/view/sql-and-relational/9781449319724/ch04s04.html. For example:

“What’s more, you have no way of knowing, of course, just which queries you’re getting wrong answers to and which not; all results become suspect. You can never trust the answers you get from a database with nulls. In my opinion, this state of affairs is a complete showstopper. […]

[…] The net of all this is that if nulls are present, then we’re certainly not talking about the relational model (I don’t know what we are talking about, but it’s not the relational model); the entire edifice crumbles, and all bets are off.”

Woah, woah, chill! That ship has sailed a long time ago and I’d be very surprised to find any real-world database that is NULL-free. You have to maintain ironclad grip over the database schema to make sure that people don’t sneak in a nullable column somewhere for convenience, ain’t nobody got time for that.

Additionally, there are actually two ways for a NULL to appear in your query result:

It can come from a NULL value in a nullable column;
It can come from LEFT JOIN, e.g. on always-false join condition (1=0);

So we have a storage-NULL and result-NULL. With careful database design we can get rid of storage-NULLs, but we can’t systematically avoid result-NULLs given that anybody can write any query.

Sentinel values

But even if you would decide to go on a quest to remove NULLs from your database, something else would remain: sentinel values. Sentinel values are “out of range” values, for example:

Empty string if the user’s bio was not filled in;
“0000-00-00” if user’s day of birth is not known;
0 as the number of pages, if the book has no pages because it’s an audiobook;
String “UNKNOWN” if the country of residence was not provided by the customer;
And so on, and so on.

Sentinel values are also extremely common in real-world databases. If you look closely, you’ll see that actually the sentinel values are very similar to NULL values. You cannot decide which of the two users is older if one of them has “0000-00-00” as a day of birth.

But in this case, the NULL conditional logic won’t kick in, and if you carelessly compare a value with a sentinel value, you will get a definite, but possibly incorrect result. All the users would be younger than the user with “0000-00-00” day of birth.

Also, you would have the same problem with human-readable semantics. What does the string “UNKNOWN” mean? Did they refuse to provide their personal data, or skipped this question, or do they have a legal right to not disclose this info? Or maybe the data was imported from a third-party source that just doesn’t have this information.

Of course, NULL itself is also a sentinel value. In the real-world databases you can see all of that and then some: NULL, empty string, “N/A” string, etc., etc.

So, it’s not enough to have a NULL-free schema, we should also consider making the schema sentinel-free.

Sentinel-free schemas

We need to provide clear guidance for common scenarios, so that people wouldn’t be tempted to sneak in a fake value.

Per-attribute tables

The first inevitable conclusion that we must accept is that we cannot have two attributes of any thing in the same table, if those attributes are optional. For example, let’s look at a table for users with columns such as “day of birth” and “bio”:

Table: users id | day_of_birth | bio | 2 | 1983-08-14 | | 3 | NULL | Painter |

A user with ID=2 has specified their day of birth, but bio is empty. Another user, with ID=3 has non-empty bio, but no day of birth information, using NULL as a sentinel value.

The only way to handle this is to put each attribute in its own table.

Table: users_day_of_birth id | day_of_birth | 2 | 1983-08-14 |

and

Table: users_bio id | bio | 3 | Painter |

And so on, for each attribute. If there is a row in the per-attribute table it means that there is a useful value. If there is no useful value, there must be no row.

Every such table is independent from each other, and you can easily create new tables for new attributes.

Of course, the performance of such tables for typical queries may be different from what you expect from commonly-seen schemas, but we’re not concerned with that at the moment. We want to see if we can get conceptually clear.

ID tables

If all the attributes are extracted to separate tables, and all are optional, how can we know which users we even have in the database? Which IDs are valid?

The answer would inevitably be: we need a separate table for IDs. This table would look quite weird: it has a single column containing just IDs.

Table: user_ids id | 2 | 3 |

1:N relationship tables

Same story happens with table columns that refer to another table. Say, you have a table of patients, with assigned GP (general practitioner). Say a GP goes out of business, and all their patients become “orphaned”. In the traditional schema we would just set “patients.gp_id = NULL”, but we don’t want to see NULLs.

So we again must use a per-link table.

Table: gp_patients patient_id | gp_id | 115 | 37 | 124 | 40 |

etc. When a GP goes out of business, we remove the corresponding rows from this table.

Explicit reason for missing data

I think that most of the time it doesn’t really matter why exactly a piece of data is not there. If the user did not bother to provide a bio it’s fine.

But, there may be some information that we insist to have. For example, we must make sure that our users are at least 13 years old. It’s possible that initially we have not been gathering this data, but then a regulator forced us to. So, we require the users to provide their day of birth, and then use this value to decide if they could access our service. If the data is not provided, we don’t let them in until they do (but then of course we can still reject them due to being too young).

But what if the information that we ask is so important that users have a legally protected right not to provide it?

Suppose that we want to ask incoming travelers if they have symptoms of a certain illness, and we want to treat them differently based on the answer. At the same time, some of them may have e.g. a diplomatic status, and thus have a legal right not to answer.

Because this is an explicit legal situation, where each possibility needs explicit treatment, we need to store them explicitly. Say, we can use an enum with three values: “yes”/”no”/”declined_diplomatic_status”.

Later, if we find that there are other reasons to decline, we can add them to the list of enum values.

What’s important here is that we avoid the discussion of “what does an absence of value mean in real-world terms”, and just encode it explicitly. There would be a different branch of a business process for each case. People with symptoms go to quarantine, people without symptoms can proceed to testing, people with diplomatic status pass through special gates.

If it doesn’t matter why the value is not there, business-wise, then don’t bother with encoding that.

Comparing sentinel-free approach with NULLs

Let’s revisit the three problems that arise around NULLs that we listed in the beginning.

First, apparently sentinel-free encoding removes a major stumbling block on the path to theoretical correctness. This is always a good thing to hear. Even if we may not really be able to refactor any real database down to this state, we can at least create a virtual view of our database. Each attribute can be represented by a subquery that filters out NULLs and sentinels from the dataset.

Second, re: “UNKNOWN”. We removed one source of NULLs, but unfortunately the result-NULLs would still be there. If we decide to use per-attribute tables as discussed above, we’ll have to use LEFT JOIN in many queries. That operator will introduce NULLs, and we’re back to square 1: we must make sure that all our queries take into account the possibility of not-present values.

Another observation is that the underlying concern is uncovered. If we had a schema that forbids NULLs but allows sentinel values, we would still have to be careful with conditional operators, because some values are not like others.

Third, speaking of human-readable semantics of missing data. Here we have a big progress. We decided to first agree on the exact meaning of what the absence of value means, in business terms, and only then to proceed with encoding that data.

This approach can even help with normal schemas: if we want, we can explicitly define beforehand what NULL means.

To be continued

I wrote on a similar topic in February: https://minimalmodeling.substack.com/p/how-minimal-modeling-treats-nulls.

Some follow-up topics that I hope to cover one day:

Historized attributes: NULL-free and sentinel-free;
Positive confirmation of absence of data, and the problem of pseudo-redundancy;
On path-dependency in schema design;
JSON, nulls and NULLs;
NullMonad and SentinelMonad.

Minimal Modeling

Discussion about this post