Database Constraints, pt. I
In the first series of posts I’d like to talk about data integrity and structural correctness. We’re going to discuss some useful design principles, available tools, define some problematic cases and talk about how we could approach handling them.
Common case and extension vs General case and special-casing
There are two approaches to many design decisions: first, you can implement the common case and then design the extension to the common case. Examples of this approach are:
In many programming languages, ordinary functions are the common case and async-functions are the special case. There is a whole “What color is your function?” article that explains the distinction between asynchronous and synchronous code using the colours metaphor;
Many introductory texts about functional programming introduce the notion of a “pure” function which does not have side effects. You cannot read and write files or send HTTP requests using pure functions, so we have to introduce something like IO Monad that allows you to write the normal “imperative” code, just like you do in ordinary programming languages. In this framework, pure functions almost always feel like the common case, while the monadic functions feel like the extension.
React.js has the setState() API function that allows you to set the local state of the component. However, when you need to use the same piece of state from different components, you must find the common ancestor of those components, keep the state there and pass it down. This is the common case, and most of the teaching texts on React.js start with explaining this. However, as an extension you can also use the global state manager such as Redux.js. Introduction of packages like that often feels like a barrier, something that has a considerable price that you should pay only if you cannot avoid it.
Common case is fast and simple. Extensions to the common case are slower and more complicated, and often require special care. There is a bias against using extensions, and sometimes, as a result, it gets less attention design-wise.
A second, contrasting approach would be: design for the general case, and then implement fast-path special-casing for commonly-encountered situations. Examples of this approach are:
Your computer’s physical memory is the general case. CPU cache hierarchy handles commonly-encountered situations when you need to access the neighbouring regions of memory, making this access much faster.
Almost all compilers implement integer division as the corresponding CPU instruction. However, a very common optimization is: if you need to divide by 4 (or other power of 2) and the compiler knows that during the compilation, it can use the “shift right” CPU instruction, which is much faster.
You can run almost any SQL query on your database server. Some of those queries are going to take a lot of time if you have a lot of data. However, for many of your queries the optimizer can use indexes, and sometimes it can even get the data straight from those indexes, without touching the table data file. In this case, the disk and memory cache are used more efficiently, and the query runs faster.
Starting from the general case sometimes allows you to better focus on improving the subsets of the problem. Understanding the special case is often easier if you have the general case as the baseline.
Database Constraints are designed as the common case
Almost all database servers implement the same list of database constraints:
Check constraints (values to be inserted or updated must be within some bounds, e.g. price > 0, or percentage <= 100);
Non-null constraints (database column cannot contain NULLs);
Unique constraints (you cannot insert duplicate values into this column);
Foreign keys (if you have an ID in this column, it must correspond to one of the IDs in another table);
Check constraints are very easy to understand and to implement in the server code: you have the row values at hand, you execute some function and if it returns false, you reject the data.
Non-null constraints are basically the same as check constraints, but NULL values have an additional meaning in the database design. Dealing with NULL values (and the reasons why we need NULL values) is a very interesting topic that we’re going to talk about yet. Checking for NULL value is also extremely easy.
Unique constraints are interesting. To implement them, you need to look at the other rows of the table. More than that, you need to look at all the other rows of the table, and there could be a lot of those. It’s hard to implement the uniqueness constraint without the assistance of the database server.
If you try to implement it manually, you have to be careful not to run into the race condition: suppose you want to insert the value 3. You first query the database to see if there is a 3 already. If it is not there, you insert that value. However, if there are two simultaneous processes trying to do that, it’s possible that they both will check for existence of the value, see that it does not exist and then both proceed with inserting, thus creating duplicate value: a classic race condition.
Database servers solve this problem by making sure that only one process at a time does check and insert. Unique constraints is an extremely useful primitive building block for more complicated validation schemes (we’re going to talk about those later). Unique constraints are also closely connected to the idea of a primary key.
Foreign keys go even further, reaching for the data from other tables. You have to pay a price for that, because you can no longer execute the data change operation “locally”. Also, this constraint does not work in the general case of multiple databases: usually database servers only support foreign keys between tables in the same database. Of course, it does not work if you use more than one database server, for example MySQL and Cassandra.
As we said in the chapter title, database constraints are really implemented as the common case, and there are many general cases that we cannot solve directly, we have to find other ways.
More on structural validations
Let us look at two more examples of data structure validation that are non-trivial to directly solve with the standard database constraints.
First example comes from Github. Github has this absolutely brilliant feature that the IDs of both issues and pull requests are globally unique: if you use the #123 notation anywhere on Github, it will automatically understand if it’s Issue 123 or Pull request 123, because there could not be both. This is really user-friendly. But how do we implement that in our database?
Suppose that we have the issues table and the pull_requests table. Both tables will probably have the id column, it’s going to be the primary key and so we have the uniqueness constraint for each table. But those uniqueness constraints are not going to look in the other table when checking for uniqueness. Hmmmm.
Second example comes from the realities of the present day. Suppose that we want to register the health declarations of incoming passengers. They must fill in the form that basically asks:
Do you have any symptoms of COVID-19 from the list below?
If no, tick here: [ ]
If yes, mark all that apply: [ ] cold symptoms; [ ] coughing; [ ] elevated temperature.
So, for each passenger we need to store that they said “yes” or “no”, and if yes, we need to store the list of symptoms. If they avoided submitting the form we must be able to detect it. Also, we must make sure that we prevent the incorrect case: both “no” and a list of symptoms, which would be ambiguous.
This post is getting quite long, so we’re going to discuss the possible solutions to both of those problems in one of the following posts.