Path dependence in relational design

Jul 02, 2024

Here is a very simple table design problem: suppose that we want to build an issue tracker, where people can leave comments and also upvote those comments.

Path 1a: initial design

The design is very straightforward (we show only the fields we’re interested in, and a couple of text fields for some realism).

CREATE TABLE issues ( id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT, body TEXT NOT NULL );

CREATE TABLE comments ( id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT, body TEXT NOT NULL, issue_id INTEGER NOT NULL );

CREATE TABLE comment_upvotes ( comment_id INTEGER NOT NULL, user_id INTEGER NOT NULL, PRIMARY KEY (comment_id, user_id) );

Path 1b: stable comment numbers

Some time passes, and we decide to add an often-requested feature: stable numbering for comments. We want to refer to each comment by URLs such as /issue/12345#c1, /issue/12345#c2, etc. If the comment is removed, we want the numbering for the remaining comments to stay intact.

So, we need to add the “seq_num” field to the “comments” table. Comment numbers are unique within each issue, so we make uncontroversial decision to also add a unique key.

CREATE TABLE comments ( id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT, body TEXT NOT NULL, issue_id INTEGER NOT NULL, seq_num INTEGER NOT NULL, UNIQUE KEY (issue_id, seq_num) );

(Here we ignore two practical concerns. First, how to migrate the comments table: you would need 1/ an intermediate schema, with nullable seq_num; 2/ backfill comment numbers for existing comments; 3/ change seq_num to NOT NULL. Second, we do not discuss how to generate seq_num when inserting a new comment: this is very much database-dependent. Both questions are valid but not important for the topic at hand.)

Natural composite vs surrogate scalar primary keys

One apparently important concern in relational design is a choice between natural and surrogate primary keys.

One traditional school of thought states that natural primary keys must be used as much as possible. Natural keys are often also composite. The opposite approach would be to embrace surrogate keys everywhere. In particular, surrogate keys would tend to be scalar (non-composite).

The comments.id field does not carry any information, it only helps identify a specific comment. We use this value as foreign key in other tables, such as comment_upvotes.

But looking at the new structure of the comments table, we can see that a combination of (issue_id, seq_num) uniquely identifies every comment, and so it can be used as a primary key. Moreover, in that case storing comments.id becomes redundant.

The big question is: should it be used as a primary key?

Path 2. Second system

Now imagine that we decided to rewrite our issue tracker from scratch (or maybe we joined a different company and this is what we need to do). From our experience with the first version we know that the stable numbering of comments is a very useful feature, so we implement it from the very beginning. Looking at the same requirements, and considering different approaches to primary keys, it’s possible that we would design tables with composite primary key:

CREATE TABLE issues ( id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT, body TEXT NOT NULL );

CREATE TABLE comments ( issue_id INTEGER NOT NULL, seq_num INTEGER NOT NULL, PRIMARY KEY (issue_id, seq_num), body TEXT NOT NULL );

CREATE TABLE comment_upvotes ( issue_id INTEGER NOT NULL, seq_num INTEGER NOT NULL, user_id INTEGER NOT NULL, PRIMARY KEY (issue_id, seq_num, user_id) -- FOREIGN KEY (issue_id, seq_num) REFERENCES comments(issue_id, seq_num) );

Note that the primary key in the comments table is now composite, and it’s used in the comment_upvotes table and other tables that need to refer to comments.

Dealing with path dependence

So, two relational designs for the same set of requirements could be different, depending on the path we took.

Arguably, on every step of the way we’ve made the right decision. However, looking only at the final design, we cannot tell if it’s as intended. If we subscribe to the point of view that composite natural PKs are good, then we would have to ask why first design involves a surrogate PK.

There are two additional options for each of the two paths that lead to identical resulting schemas.

First option: we could proceed with further altering the comments table, removing the id column and making (issue_id, seq_num) a primary key instead. To do that, we would first have to alter all dependent tables, such as comment_upvotes. In a running system, this would require a non-trivial migration process. You would have to rewrite existing code, and coordinate several table migrations with code changes. For most systems this would probably never happen, because the end result does not bring a lot of additional benefits. But technically, it’s possible.

Second option: what if we do have strong preference for composite natural keys, but value a uniform approach to physical table design? Then we would always be designing physical tables incrementally, attribute by attribute. We could require that each entity has a surrogate “id” column (so, no composite PKs), and each attribute is a separate column. Also, we would just acknowledge that the existence of a unique key on some data does not necessarily mean that it needs to become a primary key.

Why do we care?

We can make three observations:

First, even if we don’t care about the entire natural/surrogate concern (but we definitely should), we still must consider the teaching aspect. We should explain cases when relational design could be different between textbook and a real-world database. Also, how much variation is acceptable?

Second, path-dependent variations in a physical model seem to be unavoidable (another example of that is discussed in https://kb.databasedesignbook.com/posts/restaurant-attributes-design/).

Third, the relational model nudges you towards using composite primary keys: they are featured prominently in the underlying theory. If it’s there then you should use it, right? Such nudging tends to invite path dependency.

Logical model

The logical model is the same in both cases; the physical part varies accordingly. Basically, path dependence is maybe the most important feature of a logical model.

For reference, here is how it looks like for this scenario:

Conclusion

The choice of primary key is an important topic in the relational model.

We show that in certain cases final table design may be different, depending on the system development path. This is important from the teaching standpoint: you have to explain the variations.

Logical models are path-independent.

P.S.: It seems that many people don’t buy the example as it is stated. I realized that a better example would be Github. We have repositories and issues; issues are numbered sequentially per-repo; you can delete issues; you certainly want a stable issue numbering. Next version of this post will incorporate this change.

Jul 3

I'm not sure if I really understood the example; both the primary key and a (missing, but certainly will be required) created_at timestamp would be excellent candidates for sorting of comments; Using a integer seq_num field is actually the worst possible implementation I can think of, as it will most certainly require reading a previous record to determine the value on every insert, and exposes sorting to applicational bugs (concurrency issues when inserting multiple comments at once, ensure the value is not negative, etc).

Expand full comment

6 replies by Alexey Makhotkin and others

6 more comments...

Minimal Modeling

Discussion about this post