How Minimal Modeling treats NULLs
(Before we begin: I’ve started a separate blog for my non-Minimal Modeling stuff. First post: “Quit factor”, or How the industry lulls itself to sleep with “bus factor”.
Also, please read my friend Victor’s “What it takes to be a Ruby core member during the war in your country, and why I am still doing it”.)
NULLs are a fact of life, and Minimal Modeling needs to deal with them. Here is how Minimal Modeling treats some concerns related to NULLs (and some adjacent topics).
NULLs appear on the physical layer (in the table columns), while Minimal Modeling attributes form the logical layer.
Attribute values should be meaningful
First, a simple example. Suppose that we have a users
table:
CREATE TABLE users (
id INTEGER NOT NULL PRIMARY KEY,
-- . . . some other fields skipped . . .
bio TEXT
);
In Minimal Modeling terms, we have a User anchor and a User__bio attribute.
For the User anchor we would write the following documentation query: “SELECT id FROM users
”. This query returns all the IDs of users, and no other data. If we have a thousand rows in the result then we have a thousand users, and vice versa.
The User__bio attribute is described by the following question: “What is the user’s bio?”. Bio is going to be displayed on some sort of profile page. We could write the following documentation query:
SELECT id, bio
FROM users
Note that in the table schema the “bio
” column is nullable (it lacks the “NOT NULL
” modifier).
Of course, not every user is going to fill in their bio. For such users we’re going to have NULL in the “users.bio
” column.
Minimal Modeling strongly suggests that you avoid having NULL values of your attributes. If there is no meaningful value for the attribute then we just say that the attribute is not set. We really want to only set attributes to meaningful values.
So, we may want to change the documentation query for User__bio to:
SELECT id, bio
FROM users
WHERE bio IS NOT NULL
So, there may be less rows in this dataset, than in the anchor dataset. Out of thousand users, only a hundred users have something in the bio.
(NB: do not worry about the performance of that query at the moment. This is a documentation query, generally it’s not supposed to be run against the database as-is.)
Not only NULLs
If we think about how the typical ORM probably works, we will quickly realize that what actually happens during a typical user signup is that the “users.bio
” field is going to contain not a NULL but an empty string.
Above we said: “We really want to only set attributes to the meaningful value.”. And here is the key: we don’t want empty strings either.
Let’s rewrite the documentation query further:
SELECT id, bio
FROM users
WHERE bio IS NOT NULL AND bio <> ‘’
And this is where Minimal Modeling makes a pragmatic choice and does not bother you too much with this “meaningful values only” stuff. Sooner or later you will actually add all those nitpicky conditions, but it’s fine to have not completely clean data when you just begin the modeling process.
NULLs could be weird
After you see enough real-world database schemas, you realize that a typical physical layer is basically hopeless in terms of data quality. In addition to empty strings and proper nulls, you could have all kinds of weird stuff.
For strings, you could have “UNKNOWN” as a hard-coded string constant. Or “N/a” (together with “n/a” and “N/A”).
For integers you could have various so-called “sentinel values” that are supposed to mark the absence of data. For example, you could have 0 in the user's year of birth. Or -1. Or 9999.
For timestamps you of course have well-known values such as “1970-01-01 01:00:00” or something.
There are two interesting cases from my experience that I like to use as examples. First, if you go to AbeBooks.com and enter “123456789” as ISBN, you will find 850 results. Of course, there is no such ISBN (it’s supposed to have ten digits!). This is just what people enter when some tool requires them to enter something, but they don’t know what to enter or don’t bother about the completeness of the data.
Another example: I used to work on a system that sent a lot of faxes to the business partners. Somewhere in the backoffice tool the fax number was required, but not all partners had them. So people just entered something like “+177777777”, and there were a lot of such patterns of all kinds. At some point I did an experiment and wrote a function that detected such repetitive patterns and just skipped sending.
All such examples are actually NULLs in disguise.
And this is where we again revisit the “We really want to only set attributes to the meaningful value”, and we see that now it makes sense to document such cases. We can write the following query for the User__year attribute:
SELECT id, year
FROM users
WHERE year BETWEEN 1900 AND CURRENT_YEAR()
As for the fax numbers example, we probably can’t even solve this only in SQL. But because the catalog is primarily human-readable, we can just add some information in a readable form, e.g.:
“Here is the query, but:
a) use
is_valid_fax_number()
Java function to check if the number is valid andb) the cleaned-up dataset of valid fax numbers is maintained in
users_valid_faxes
secondary table, so query it instead.”
How to find the average age of user
Let’s see how this translates in business terms. Suppose that ask the question: “What is the average age of our users?”
It would be very natural to treat the documentation queries as independent elementary datasets. You could implement that as a view, materialized view or just as Common Table Expressions (CTE).
So, we can write something like:
WITH user__year AS ( -- attribute query
SELECT id, year
FROM users
WHERE year IS NOT NULL AND year BETWEEN 1900 AND CURRENT_YEAR()
)
-- business query
SELECT CURRENT_YEAR() - AVG(year), COUNT(*)
FROM user__year
And then we can find the total number of users, to find out how many have provided that data, with:
WITH user__ids AS ( -- attribute query
SELECT id
FROM users
)
— business query
SELECT COUNT(*) FROM user__ids
Here the queries in WITH statements are standard, they are provided in the documentation, and you can be sure that the data returned is as clean as it gets.
To be continued
We discussed how Minimal Modeling treats the absence of data. In the second part of this series we’re going to discuss the other side of NULL handling: what does the absence of data mean.