Subtypes and status-dependent data: pure relational approach
In this post I’ll show how to use constant columns to solve a common problem of encoding status-dependent information (or subtypes, more generally).
Constant columns always contain the same hardcoded value in each row of the entire table, enforced by a check constraint. This is a genuinely interesting approach that I think is not common knowledge.
Every time I post some new content that talks about table schemas I get a handful of scathing remarks from people who suggest that I haven’t read enough of works by certain obscure authors that I supposedly never heard of, most commonly Codd and/or Date. It’s not always clear how my thinking is supposed to change, if I were to engage in that type of activity, but I am very much interested in how other people’s thinking changes as the result of reading theory, so I always try to understand the presented opinion, however unflattering and often opaque.
This time I received a genuinely interesting insight about encoding subtypes in a way compliant with the Make Invalid States Unrepresentable principle. “Subtype” is a bit abstract: specifically, we’ll apply it to modeling statuses and some data that is only valid for a specific status.
Problem statement
Consider the following problem: we need to build a system that lets customers upload image files, and processes those files, detecting various animals pictured in those images. An uploaded image could be in three possible states:
pending;
invalid: in this state we need to record an error message, such as “unrecognized format”;
completed: in this state we need to record a list of animals detected in the image;
We want to design a solution that removes the possibility of incorrect (contradictory) data. We are interested in purity, so let’s make the schema NULL-free. We’ll use classic relational constraints such as CHECK and foreign keys.
Solution via constant columns
To maximize the “TL;DR” factor, let’s jump straight to the solution. We need three tables:
main table that contains images;
second table that contains errors for invalid images;
third table that stores the list of detected animals;
(Here is the playground link with the schema and some queries: https://dbfiddle.uk/h7gRpBbI).
CREATE TABLE images (
id INTEGER NOT NULL,
status VARCHAR(16) NOT NULL,
image VARCHAR(4096) NOT NULL,
CHECK (status IN ('pending', 'invalid', 'completed')),
PRIMARY KEY (id),
UNIQUE KEY (id, status)
);
CREATE TABLE invalid_images (
id INTEGER NOT NULL,
status VARCHAR(16) NOT NULL DEFAULT 'invalid',
error_message VARCHAR(128) NOT NULL,
CHECK (status = 'invalid'),
PRIMARY KEY (id, status),
FOREIGN KEY (id, status) REFERENCES images (id, status)
);
CREATE TABLE detected_animals (
image_id INTEGER NOT NULL,
status VARCHAR(16) NOT NULL DEFAULT 'completed',
ndx INTEGER NOT NULL,
latin_name VARCHAR(64),
CHECK(status = 'completed'),
PRIMARY KEY (image_id, status, ndx),
FOREIGN KEY (image_id, status) REFERENCES images (id, status)
);
Annotated schema: images
Here is the annotated guide to this schema. It is written in an old-school way with the goal to make it maximally illustrative.
In the images table we have a synthetic integer “id” column. In a real database we would use an auto-incremented value.
The status column contains one of three string values: ‘pending’, ‘invalid’ and ‘completed’, enforced by a CHECK constraint. In a real database we would use a sort of enum or small integer, to save space.
We store the “image” as a string here, imagine that it’s base64-encoded or something, it doesn’t matter. In a real database we’d use some BLOB type, or store the image file in S3, or whatever.
We have “id” as a primary key, but we also create a unique constraint on two columns: (id, status), even though “id” is already unique. We need it because we need to establish some foreign key constraints.
Annotated schema: invalid_images
Now for the “invalid_images” table. The interesting part starts here. This table needs to contain only records about invalid images.
We have the normal “id” column that is the same as “images.id”.
We also add a “status” column that can contain only one value: ‘invalid’, and this is enforced by the CHECK constraint. We also set it as the default value for convenience.
We define a composite primary key made of both columns: (id, status). Also, we define a foreign key to the same pair of columns in the “images” table.
This is where the magic happens. We can now have the error information (“error_message” column) only for images in the invalid state. If the image status in the main table is different, you get a foreign key constraint violation.
Annotated schema: detected_animals
This example is purposefully constructed to include information of different cardinality. Otherwise we’d be tempted to collapse the schema into something more simple that does not illustrate the concept.
Here we have a composite primary key made of three columns: (image_id, status, ndx). Here “image_id” is the same as “images.id”.
Status is always ‘completed’, enforced by the CHECK constraint. For convenience this is also a default value.
The “ndx” column is probably auto-generated too. It does not matter what it contains, the values only need to be distinct among rows related to the same image. We could use 1, 2, 3, starting anew for each image, or to use a globally-incremented value.
Here we also define a foreign key constraint on (image_id, status) pair, referring to (id, status) pair in the “images” table. This means that we can insert rows into this table only for images in the ‘completed’ status.
Example queries
Let’s try running some queries to confirm that correctness is preserved. (Error messages vary from server to server, here PostgreSQL is used).
-- insert a new pending image
INSERT INTO images (id, status, image)
VALUES (1, 'pending', '0x1234567');
-- we cannot set error message of pending image
INSERT INTO invalid_images (id, error_message)
VALUES (1, 'Corrupted JPEG file');
-- ERROR: insert or update on table “invalid_images” violates foreign key constraint “invalid_images_id_status_fkey”
-- DETAIL: Key (id, status)=(1, invalid) is not present in table “images”.
-- we cannot record detected animals too
INSERT INTO detected_animals (image_id, status, ndx, latin_name)
VALUES (1, 'pending', 1, 'Canis lupus');
-- ERROR: new row for relation “detected_animals” violates check constraint “detected_animals_status_check”
-- DETAIL: Failing row contains (1, pending, 1, Canis lupus).
Now testing invalid image:
-- we set an error message for the invalid image
INSERT INTO images (id, status, image)
VALUES (2, 'invalid', '0xABCDEF');
-- note that status column is handled automatically
INSERT INTO invalid_images (id, error_message)
VALUES (2, 'Parse error');
-- now we cannot change its status back to pending or to completed
-- we need to delete a row from invalid_images first
UPDATE images SET status = 'completed' WHERE id = 2;
-- ERROR: update or delete on table “images” violates foreign key constraint “invalid_images_id_status_fkey” on table “invalid_images”
-- DETAIL: Key (id, status)=(2, invalid) is still referenced from table “invalid_images”.
Finally, completed image with some detected animals works:
INSERT INTO images (id, status, image)
VALUES (3, 'completed', '0xFFFFFFF');
-- note that status field is handled automatically
INSERT INTO detected_animals (image_id, ndx, latin_name)
VALUES (3, 1, 'Felis catus'), (3, 2, 'Ovis aries');
You can try for yourself to bring this system into some sort of contradictory state: https://dbfiddle.uk/h7gRpBbI.
Where can you learn about this?
There is a lot of discussion of modeling subtypes available on the internet and in some books. However I am not aware of this approach presented clearly in a proper teaching manner. Many books discuss subtypes but in the best case you have to understand this approach first to recognize it in the text. In many cases table-level design is not discussed at all, or presented with some handwaving, or the lax design is presented (with nullable attributes and potential dangling references).
If you know where one could have learned this as part of a standard curriculum please share specific references to books (with page numbers) or other materials.
I finally understood this topic just a few weeks ago after another shot at comprehending a StackOverflow reply from 2011 by the great PerformanceDBA: https://stackoverflow.com/a/4898314/7754. Specifically, the explanation in Subtype.pdf. (PerformanceDBA’s writing style is rather challenging for the unprepared reader.)
Now when I understand this, I can more easily find other examples of this technique. Here is a reply by Lennart: https://dba.stackexchange.com/a/273818 (written in 2020), where this approach is used without any elaboration. (This link could be found on the fourth page of Google search results). I picked up the idea of using “DEFAULT” from this reply.
In C. J. Date’s “Relational Database: Selected Writings” (1986) a similar problem is discussed in section “Entity supertypes and subtypes” on pages 457—460 (almost at the end of the book, in “A Practical Approach to Database Design” chapter). On page 459 only a lax design is discussed: EMP and SALESMAN tables, connected by the “EMP#” foreign key. Date goes on to say:
“A disadvantage of this representation is thus that it introduces some additional foreign key constraints that need to be maintained. In fact it also introduces another integrity constraint that (given the state of systems to-day) will also have to be enforced by application code, namely that a given employee number must appear in the
SALESMANtable if and only if that employee has job ‘Salesman’ in theEMPtable. Nevertheless, it is still the case that the representation is logically cleaner than that involving just a single table.”
I’m not even sure if the CHECK constraint existed back then (it was standardized in SQL-92, apparently). Anyway, this specific CHECK constraint could also be simulated by a foreign key to a single-row single-column table that contains a single string such as ‘invalid’.
In Hugh Darwen’s “An Introduction to Relational Database Theory” (2010) this is discussed on pages 230—232, again at the very end of the book. Unfortunately, a very contrived example (“the wives of Henry VIII”) is presented, and no specific table structure is displayed, so it’s hard to recognize if this technique is used.
I hope to find more references someday, but at the moment this does not seem to be common knowledge.
Footnotes
Creating a constant column feels extremely weird for the database practitioner. At the same time, similar usage of constant values is ubiquitous in programming language theory. It’s possible that some people more easily cross this internal barrier to understanding, or they are lucky to learn it from somebody who did. When you do, this approach instantly feels very natural, and more things start falling into place.
Particularly, it feels awkward to store repeated copies of the same data. One can notice that the database server could theoretically optimize this away, making this column completely virtual, existing only in table metadata. However, as this approach is not commonly used in real-world schemas, it’s easy to see why it may not have been a priority for the developers.
A more lax design could be commonly seen on the internet. You can remove both constant columns (invalid_images.status and detected_animals.status), and use just the “image_id” column as the foreign key. This, unfortunately, allows some incorrect and contradictory states: for example, you can have a completed image with an error message, or you can store a list of detected animals for the invalid image. This is often considered not a big deal in practice, but the goal of this post is to learn what the theory allows.
Additionally, in the lax design we would just move the “error_message” column into the “images” table, make it nullable and set its value only for invalid images. We could add a CHECK constraint that enforces that part.
In a follow-up post we could discuss how this pure schema deals with extending, in three different directions:
additional options for the item status (e.g., how to record that the image processing crashed);
additional status-specific information, for example a time stamp when the image has been processed (for both invalid and completed images);
additional subtype; for example payment status of each image: invoiced, paid, refunded, etc. This is independent of the item processing status. Of course, you can also have extra payment status-dependent information.
One final note: the word “subtype” may suggest that we discuss the identity of an object, and this is very common to see on the internet and particularly in books. For example, we can think about selling different items, such as books and washing machines, with type-specific attributes (number of pages vs loading volume). As you can see from this post, statuses can also be represented by the same approach, so the word “subtype” may have been slightly misleading in the commonly available materials.
Conclusion
This post presents a clear description of a pure relational approach to modeling subtypes and dependent data. This approach is known, but not very well known, and not taught clearly, so I hope that this post may serve as a good reference.
The main contribution of this post is a carefully constructed example, consistent with typical real-world needs. This example cannot be simplified without losing completeness. At the same time, it illustrates a possibility of extending which is also necessary in real-world applications.
You may not be using this technique in the real world, but I believe that understanding this pattern really helps with understanding the classical theory behind database design.
