Links and composite keys
First, I’d like to thank everyone who left their comments on the book draft. I’ve read them all, and now I’m going to add more text to address feedback. Also, I need to write more chapters on modeling some use cases. I consider the first phase of beta-reading very successful. Thank you!
Let’s continue the investigation of two scenarios from the previous posts:
(As an experiment, there is now a “Summary” chapter at the end of this post.)
We already mentioned one possible table structure for the “gifts” scenario (without indexes for now):
CREATE TABLE gifts (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
from_user_id INTEGER NOT NULL,
to_user_id INTEGER NOT NULL,
item_id INTEGER NOT NULL
);
We can write down similar table structure for the “performances” example (without indexes for now, see below):
CREATE TABLE performances (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
concert_id INTEGER NOT NULL,
musician_id INTEGER NOT NULL,
instrument_id INTEGER NOT NULL
);
Composite keys in the performances
table
We know that a user can send a gift to another user multiple times, that’s why we had to create the Gift anchor. But in the other case we can see that one musician can play a certain instrument only once in the concert lineup. In other words, the three values (concert_id
, musician_id
, instrument_id
) are unique. We could even say that they must be unique.
This immediately suggests that we should add a unique constraint on those three fields:
CREATE TABLE performances (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
concert_id INTEGER NOT NULL,
musician_id INTEGER NOT NULL,
instrument_id INTEGER NOT NULL,
UNIQUE KEY (concert_id, musician_id, instrument_id)
);
Moreover, if we’re inclined towards relational modeling purity, we can say that there is no need for the id
primary key column, because it’s just a meaningless number. The three real fields could well be a composite primary key:
CREATE TABLE performances (
-- id column is now gone
concert_id INTEGER NOT NULL,
musician_id INTEGER NOT NULL,
instrument_id INTEGER NOT NULL,
PRIMARY KEY (concert_id, musician_id, instrument_id)
);
This is a perfectly normal way of encoding the Performance anchor. It’s somewhat old-school maybe, but the logic behind this table is very solid, it’s hard to argue against it.
Splitting the tables further
One of the staples of Minimal Modeling is that each attribute and each link is independent. They only depend on their anchors, and anchors are all also independent.
Because of that, we always accept a possibility of minimal physical representation: a table per anchor (without attributes), table per attribute, table per link. This may or may not be a practical physical representation, but this representation helps us to overcome the bias of traditional physical representations. Let’s try and break down both our examples as much as possible.
The “gifts” example can be split into four tables: one table for the anchor and three identically-structured tables for each link:
CREATE TABLE gifts (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT
-- nothing else here, except for timestamps
);
CREATE TABLE gifts_from_user (
gift_id INTEGER NOT NULL,
from_user_id INTEGER NOT NULL,
PRIMARY KEY (gift_id, from_user_id),
INDEX (from_user_id)
);
CREATE TABLE gifts_to_user (
gift_id INTEGER NOT NULL,
to_user_id INTEGER NOT NULL,
PRIMARY KEY (gift_id, to_user_id),
INDEX (to_user_id)
);
CREATE TABLE gifts_item (
gift_id INTEGER NOT NULL,
item_id INTEGER NOT NULL,
PRIMARY KEY (gift_id, item_id),
INDEX (item_id)
);
Of course, this representation adds some overhead because you have to store four copies of primary keys, but that’s fine.
Now when we try to split both versions of the performances
table, we realize that we cannot do that easily. The three columns, concert_id
, musician_id
, and instrument_id
are entangled by the uniqueness constraint, either in the form of primary key, or just in the form of unique index.
What particularly troubles me here is that I’m not completely sure about the status of that uniqueness constraint. Both examples, “performances” and “gifts”, have very similar minimal models: one anchor with three links. But uniqueness constraint seems to greatly influence the table design. Should it be so?
Musician can be recorded as playing an instrument only once during a given concert. That is, the three-element tuple would have only unique values in our database. Then we make a leap and require uniqueness: we say that three-element tuples MUST have only unique values. It’s perfectly clear where it comes from, I’m not arguing that: we need every help we can get to make sure that our data is clean. But this sleight-of-hand change of modal verbs is concerning.
Note that if we were dealing with the dataset with duplicates, we could trivially de-duplicate it with SELECT DISTINCT.
Next, when we introduced the composite primary key, removing the auto-increment performance_id
column, we actually got two different things for the price of one: primary key gives us both a unique index, and it subtly introduces an identity, because that’s what primary keys are: they are supposed to be an identity of a table row.
But in this case, is it really? Let’s look at this problem from the other side. What if we were presented with the second version of the performances table, the one with the composite primary key, and asked to write down its minimal model?
We would then have two options:
Use the 1+3 model, introduced in the previous posts. Declare that the ID of the Performance anchor is a three-element tuple of three table columns (because we don’t have anything else to use as an ID)
Ask again, wouldn’t it be a 3-way link that is supposedly missing from the Minimal Modeling?
Let’s go back to the possibility of further splitting the first version of the performances table, the one with an auto-increment primary key. We can split it into five tables:
one table for the anchor (it keeps only the auto-incremented primary key ID);
three tables, one for each column;
one table that has the same three columns declared as a composite primary key, but used only to maintain the uniqueness constraint.
To insert into such a database, we would execute five INSERT
statements in the transaction. If we made a mistake and tried to insert a duplicate record, the insert to the last table would fail, and so the entire operation would fail.
So, there is actually a way to split the data that is tangled by the uniqueness constraint.
Summary
Why do we care? Here are some ideas that influence our thinking about those two use cases:
Splitting physical tables as much as possible is a useful mental exercise.
Thinking in terms of primary keys may unnecessarily influence your understanding of the model.
Primary keys are two entangled things: uniqueness constraint and identity. They can be untangled.
Physical schema could have decoupled writing and reading (the DISTINCT example and the “five tables” example).
CQRS (Command-Query Responsibility Segregation) approach could be seen even in such small example as ours.
You can get unintended requirements if you are not careful with your modal verbs. “Are unique” and “must be unique” are two different things: one is for the reader, another is for the writer.
Transactional atomicity together with unique constraints can be used to implement interesting data validity constraints.
Unique index can be considered a special side table that keeps a copy of inserted data and maintains the uniqueness constraint. This side table can be made explicit.
Some of those bullet points would be a topic for future substack posts.