Game theory in database design
Game of tables
Let’s talk about the game of database tables. The rules are quite simple. First, we have a noun, (or entity, or anchor): things like User, Order, Post, Invoice, etc., etc.
Second, nouns can have attributes that contain data. For example, the data about the User can include username, real name, date of birth, avatar image, password, “prefers dark mode” setting, and so on.
Attributes could be stored in physical tables, and there are several acceptable ways to do that.
A wild new attribute appears! Your goal is to find a place for it in the database.
Possible moves
There is always a main table for each noun, and it’s usually called something like users. You can more or less freely create new tables and add attributes to one of the existing tables.
So, we can imagine the following list of possible moves:
add a field in the main table;
add a field in a side-table that groups several fields, usually by some kind of topic;
add a new single-attribute table that contains two fields: id and the attribute value;
use one of the key-value tables associated to this noun (sometimes this approach is called “schema-less”);
use free-form JSON-style storage (this is a variation of the previous item).
There are probably one or two more exotic ways to store data, which I don’t remember at the moment, but no more than that. So we have five to seven possible moves, which is a very good number: it’s easy to keep the entire list in your head while playing this game, and it’s quite easy to teach it to the new people).
How do you choose a move?
There are several concerns that you may consider when you choose one of the possible moves to play:
how hard it would be to add a new field to the main table or the side-table, considering the amount of data and server implementation;
is there a “good” topic-based side-table that feels natural as the new home for this attribute? If there is not, can we find a good new topic for this and related attributes, e.g. “payments stuff”?
do we envision some kind of typical queries that would be used with this attribute? If those queries are performance-sensitive, we may want to put it closer to other attributes that would be used in those typical queries. (But of course, we cannot know beforehand how the querying requirements will evolve over time).
do we have any specific performance expectations from the already existing tables? For example, if we have a not-too-wide table, the data is spread out quite efficiently on disk; if we add a wide attribute, the data density may drop and the existing queries will be executed with a different performance profile;
do we have specific requirements for the new attribute? For example, if it is something like “real name of the user” we may want (or be legally required to) to store it in a special PII-aware database. Another example is avatars: binary files that we may want to store in S3-style storage.
does this attribute feel more important? In the DevOps world the servers are classified as “pets” and “cattle”: each pet-style server is carefully maintained by system administrators, while cattle-type servers could be provisioned and deprovisioned by hundreds and thousands. It seems that some data attributes feel more like pets, and some more like cattle. For example, all kinds of low-importance string attributes could all live together in a single key-value table.
How many attributes does this noun have already? We may have one modeling strategy at the time when there are only a few initial attributes (or we just started designing the main table), compared to a different modeling strategy when there are hundreds of attributes already.
What does the company ecosystem look like? Are there any modeling guidelines, expectations, written and unwritten requirements and traditions, will you have to defend the proposed schema change?
The more experience you have with the table modeling, the more concerns you would have in this list. We can relatively easily add a dozen more items here, but even this list would be enough for now.
The point is that a solution to the combination of those concerns would always be a single move from half a dozen allowed moves.
Poker analogy
Software development community likes to call it trade-offs, but from a certain angle it looks more like a game of poker. Let’s take Texas Hold’em as an example. During the betting round we have four allowed moves: check, call, raise and fold. But the number of “concerns” that we have is much, much higher: from the cards that you have to the number of chips, from the number of people at the table to the level of the tournament. Psychological aspects are also important: how tired you are, are you on tilt, how much did you drink, your general mood, the facial expressions of your opponents, etc., etc.
The more experience you have with poker playing, the more considerations you would have in this list.
But again, when it’s your turn to play, the ultimate solution to the combination of those concerns would always be check, call, raise or fold. You can’t not play.
Back to the tables
So, we have relatively many concerns and relatively few possible moves. In database modeling, thankfully, our moves are not final, compared to the poker game: we have ways to change the physical representation of the database (this, by the way, is one of the concerns that you could take into account).
If we focus on each move that we make, we see that it emerges from intertwined concerns that guide us towards one of them. How can we decide that someone’s move was suboptimal? We can point out one or more miscalculated or ignored concerns; but it’s also possible that we’ve miscalculated or ignored (or did not know about) a different concern. We may be in a different phase of the project than the person who made the move, we may have a different workload or a mental or emotional state. Our hindsight may be 20/20.
The funny thing is that we may not know anything at all about the database modeling concerns: a parrot has one in a six chance of finding a good place for the attribute. The winner takes it all: if somebody agrees with your decision about the table design, they may never find out that you did it for the reasons that they would never have agreed with!
The point of this discussion is that investigating concerns is more important, and it’s actionable. If, on the contrary, you think about the possible moves then the main tool that you have is counterfactual analysis: what would have happened if we made a different move? Would it be better or worse than what we have now?
It’s easier to think in terms of direct decisions than in terms of counterfactuals. It’s easier to have a direct checklist or a decision tree that helps you to choose the optimal technical decision.
One of the goals of this substack is to thoroughly investigate the concerns. But there is yet a couple of more abstract topics that I’d like to discuss in the following posts.
P.S.: follow me on Twitter: https://twitter.com/alexeymakhotkin.