Left joins, pt. II
In the previous issue we began discussion of left joins. Today I’d like to talk about the spectrum between narrow and wide tables, and how both kinds of tables are connected via left join.
So, the fundamental question of database modeling is where to put the field for the new attribute. There are three possible decisions:
Use the main table for the noun (or entity, or anchor) in question;
Use a side table that keeps several fields together, usually grouped by topic;
Create a single-attribute table with two fields: id and attribute value.
Most databases that you encounter in the wild use the main or side table approach. Most often people begin with the main table and add fields until it begins to crack under strain, then switch to side tables. The resulting distribution of fields between the tables is best explained by the Game of Tables.
Pure main table approach and single-attribute tables approach are rare, for two different reasons. Putting all the fields in the main table will sooner or later become too inconvenient (but there are reports of actual schemas with 500+ column tables). Single-attribute tables are the victims of traditional join-aversion, described in the previous issue. However, both of those approaches in their pure form allow us to see how left join works.
We’ll again use a simple example: the noun/entity/anchor is User. It has the following attributes: loginname, real_name, year_of_birth. There is also a special ID attribute.
Views: Joining single-attribute tables
So, we create a table for each attribute:
CREATE TABLE users_loginname (
id INTEGER NOT NULL PRIMARY KEY,
loginname VARCHAR(32) NOT NULL);
CREATE TABLE users_real_name (
id INTEGER NOT NULL PRIMARY KEY,
real_name VARCHAR(128) NOT NULL);
CREATE TABLE users_year_of_birth (
id INTEGER NOT NULL PRIMARY KEY,
year_of_birth INTEGER NOT NULL);
The tables are completely regular, they use the same name pattern, they differ only by the data type of the attribute value (string, integer, etc). Creating and dropping such tables could be fully automated, we could implement transparent support of such tables in our data access layer, etc., etc.
There is one interesting question: where do we keep the ID of users? Traditionally we have something like an automatically incrementing integer counter that allows us to have a number corresponding to each user. That number is pretty meaningless by itself, but it’s used to keep track of links between users and other nouns in our system, such as payments, order, posts, etc.
Concatenability principle demands that we should be able to drop every attribute that we’ve created. It means that we cannot use any of the tables defined above: we can imagine a situation where this attribute is no longer needed and it’s even undesirable to store.
Also, we can imagine a situation when we know that the noun is needed (there are going to be users in our system, surely), but we did not yet agree which of the attributes we’re going to use. This strongly suggest that there should be a separate table for storing the ID attribute:
CREATE TABLE users (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY
);
There are no more fields in this table, and this table is not going to change any more. Its only purpose is to keep track of user ids.
Such a design looks weird from the point of view of traditional database modeling, but it seems to be the only solution under the constraints that we’ve agreed on in this section. There are many interesting things to say about the IDs, but this is going to be the topic for one of the future posts.
So, we now have four tables. As software developers, we want to query our tables with ad-hoc queries to quickly debug system issues, explore the data, generate quick statistical reports etc. In the traditionally-designed table we’d have a single table called users with five fields: id and three attributes. Then we’d be running our “SELECT * FROM users WHERE …” queries to our heart’s content. In this schema we would have to write multiple LEFT JOIN operators which is incredibly tedious.
Developer experience is always an important concern in database modeling, and we can actually make life easier for humans. We can create something called “view”, which is a virtual table with the corresponding query. When we query the view, the underlying query is executed instead (we can also use the usual SQL operators such as WHERE, the database server is going to handle everything behind the scenes).
So, let’s create a view (called v_users) that returns exactly the same dataset as if we used the traditional wide table:
CREATE VIEW v_users AS
SELECT users.id, users_loginname.loginname, users_real_name.real_name, users_year_of_birth.year_of_birth
FROM users
LEFT JOIN users_loginname ON users.id = users_loginname.id
LEFT JOIN users_real_name ON users.id = users_real_name.id
LEFT JOIN users_year_of_birth ON users.id = users_year_of_birth.id;
The definition of this query may be changed when we add or remove attributes. We can query this view, for example:
> SELECT * FROM v_users WHERE loginname = ‘vasya’;
id | loginname | real_name | year_of_birth |
4 | vasya | Vasily V. | NULL |
The modern database servers should do a pretty good job optimizing this kind of queries. Because the names of the tables are completely regular, we can easily auto-generate the query behind the view. Also, this kind of views is really only needed for the people and their ad-hoc queries: the actual code can work directly with the single-attribute tables.
Anyway, the point is that we’ve managed to make a huge leap from one side of the spectrum, single attribute tables, to the other side of the spectrum, wide tables. If you squint a bit you can even miss the fact that v_users table does not, in fact, exist.
Even more: views could be materialized! The database would create an actual table that would contain a copy of data from all the single-attribute tables. When the tables will change, the database will automatically update the materialized view. Querying such a view is going to be exactly as performant as if it was a normal wide table.
Few years ago this was an important moment of understanding for me personally: the ordinary tables are actually just materialized left joins of attribute values.
Views: Extracting attributes from wider tables
Now let’s look at this from the other side of the spectrum. Let’s begin with the normal users table, with a completely predictable and boring structure:
CREATE TABLE users (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
loginname VARCHAR(32),
real_name VARCHAR(100),
year_of_birth INTEGER
);
We may represent this wide table as four virtual views, each corresponding to an attribute. For example,
CREATE VIEW v_user_ids AS
SELECT id FROM users;
CREATE VIEW v_user_loginname AS
SELECT id, loginname FROM users;
And so on. Again, we can use those views in our queries freely. If we don’t look too close we would never know how our tables are actually represented physically: as wide tables or as a handful of single-attribute tables.
So, here we made a (much easier) jump from the narrow side to the wide side of the table design spectrum.
This exercise is useful because it shows how conceptually flexible the database tables actually are. In our opinion, one of the problems of practical database design is that they are perceived as much more rigid as they really are. One of our goals is to loosen up this perception even more, deconstructing it from several sides.
In the next post we’ll try to talk about the natural counterpart of left join: UNION ALL.
P.S.: follow me on Twitter: https://twitter.com/alexeymakhotkin.