Many faces of left join
I’d like to discuss the idea of left join today. (See the summary at the bottom of this email.)
LEFT JOIN operator in SQL
Traditionally SQL operators are written in upper case which is quite useful, notation-wise. We could distinguish between the general idea of left join and the LEFT JOIN operator.
A simple nonsensical example of LEFT JOIN operator:
SELECT tblA.foo, tblB.bar
FROM tblA LEFT JOIN tblB
ON tblA.quux = tblB.baz AND tblA.qux <> 4
WHERE tblA.quuux > 20000;
This is a really bad example (its meaning is not clear) but it shows one aspect of SQL: the SQL server will generally do exactly what you tell it to do, if you got the syntax right. The query execution plan may be horrible, and the results may be nonsensical, but you will get what you asked for (even if you did not fully understand your own question).
For the purposes of this discussion we’re going to restrict the LEFT JOIN. Suppose we have a noun (or entity, or anchor); it has an ID (also known as primary key). When we need to store new data attribute, we have essentially three choices: a) add a field to the main table; b) add a field to the side-table that holds a number of fields, usually under a common theme; and c) use single-attribute narrow tables.
So, we’re going to use the following restrictions:
join only the tables related to a single noun, e.g. “user”;
the ON clause of LEFT JOIN operator is going to be strict ID equivalence, such as ON users.id = users_pii.id.
in the WHERE clause we’re going to use only WHERE id IN (…) construct, and no other conditions.
Suppose that we have a table called “users” with the following fields: (id, loginname, password) and a table called “users_pii” that contains personal data about the users, such as: (id, real_name, year_of_birth). Some users did not provide their personal data, so there is no row in users_pii. The id field in both tables is declared as primary key: this prevents duplication of values, and in every real-world database it means that fetching the row corresponding to a certain ID is essentially one of the most efficient operations imaginable.
To get the login name and real name of users with IDs 2, 3, and 5, we’re going to use the following SQL statement:
SELECT users.id, users.loginname, users_pii.real_name
FROM users LEFT JOIN users_pii ON users.id = users_pii.id
WHERE users.id IN (2, 3, 5);
It could return the following dataset:
id loginname real_name
-- --------- ---------
2 vasya Vasily V.
3 kolya NULL
5 anya Anna A.
We can see that user kolya did not provide their real name, so there is no corresponding row in users_pii table, so the NULL value is used instead.
So far it’s all mostly trivial, we just need to set up the scene.
LEFT JOIN performance
SQL joins seems to be one of the software development topics that has accumulated a lot of history over the decades. SQL joins invoke a dozen of different contexts: 1/ relational database theory, 2/ the role of joins in practical SQL queries, 3/ how SQL joins are taught, 4/ the enormous engineering effort that went into typical SQL server optimizations of joins over the years, 5/ hardware details such as reading from disk or traversing memory arrays, 6/ algorithms used to implement typical database indexes such as B-Tree, 7/ multiple overlapping mental models of people who started their careers in different half-decades and now pass those mental models to younger people who also started their careers in different half-decades, 8/ the capabilities of commonly-used ORMs and query builders, etc., etc., etc.
In practice it means that some people tend to be join-averse and they are not afraid to show it.
For the purposes of this discussion we ignore the question of “where did we get that list of IDs?”. For example, we could get it from the query such as “get IDs of users who were born in 1990 and have the name starting with the letter ‘c’”; or we can get the IDs of all users who liked a certain post, or whatever. We’ll discuss this kind of searching later, for now we just assume that we have a list of IDs.
With the restrictions that we agreed on, the LEFT JOIN queries are among the most optimizable in every database server.
The key characteristic of primary key lookups is not that they’re fast or slow, as sometimes mentioned, but that they have low variability of processing time. Primary key lookup in practice is O(1), partly because each ID is unique and so it has a maximal cardinality. Let’s also note the fact that low variability of execution time is good for scalability.
JOIN-free queries
There is one very good reason to not use JOINs: your data could reside on different database servers, in different databases, on different hosts. Suppose that our users table is in a MySQL instance, and the users_pii table is in an appropriately secured PostgreSQL.
In this case we cannot use LEFT JOIN the SQL operator, and we must do client-side join. We send one query to MySQL:
SELECT users.id, users.loginname
FROM users
WHERE users.id IN (2, 3, 5)
And another one to PostgreSQL:
SELECT users_pii.id, users_pii.real_name
FROM users_pii
WHERE users.id IN (2, 3, 5)
The first query is going to return the following dataset:
id loginname
-- ---------
2 vasya
3 kolya
5 anya
And the second query is going to return:
id real_name
-- ---------
2 Vasily V.
5 Anna A.
(Note that there is no row for the id=3).
Now we can post-process those two datasets using our favourite programming language, and get exactly the same dataset as before. The work of constructing it is going to happen not on the database servers, but in our application. We may even decide not to bother with combining both datasets, but just use both directly. For example, we can convert them to a kind of hashtable or dictionary that maps id to the attribute value.
The performance of this approach is going to be roughly similar to the SQL version in terms of low variability: both database servers execute primary key lookups, and our client application inserts each row in a hashtable, which is also basically a O(1) operation.
Single-server JOIN-free query
Let’s go back to the situation where both tables are in the same database. We could get rid of LEFT JOIN even in this case, of course. Most SQL servers can execute multi-part SQL queries: you just provide semicolon-separated queries, and it returns a sequence of datasets in the same order.
This feature is not often supported in popular ORMs which is a shame. Each SQL query requires to a) send the SQL statement to the server, b) wait until it processes the query, c) receive results. After the results are received, you can send another query. This roundtrip takes some time. If you can send multiple queries and receive multiple results you can get a performance improvement by reducing the latency.
Multi-part queries are often supported by low-level drivers. So, we can send both queries, separated by a semicolon:
SELECT users.id, users.loginname FROM users WHERE users.id IN (2, 3, 5);
SELECT users_pii.id, users_pii.real_name FROM users_pii WHERE users.id IN (2, 3, 5);
And we receive the two-part dataset:
id loginname
-- ---------
2 vasya
3 kolya
5 anya
==============
id real_name
-- ---------
2 Vasily V.
5 Anna A.
==============
Parallel execution of queries
Now that we mentioned the query latency, we can go back to multi-server queries and note that we can send both queries (to MySQL and to PostgreSQL) simultaneously, then wait for both responses, so it will only take as much time as the slower query: that would be a win!
This is true even for single-database queries. If we have multiple replicas (which is almost always the case), we can send both queries to different replicas, wait for both results and save some time too.
This would require us to have a programming environment that allows us to reliably make such parallel queries, but it’s not a big problem nowadays. The performance gains may be such that this additional complexity would pay off (or we could even hide a lot of this complexity behind some kind of facade).
Rewriting queries
We agreed to restrict the syntax of LEFT JOIN. That means that we can now easily parse it programmatically, and we can transform it to something else. Let’s look at the original SQL query again:
SELECT users.id, users.loginname, users_pii.real_name
FROM users LEFT JOIN users_pii ON users.id = users_pii.id
WHERE users.id IN (2, 3, 5);
We may notice that the query does not really specify where exactly the users and users_pii tables live. For example, they could live on different servers, or on different replicas or shards. Our application expects a result dataset of a certain form, but it does not care how this dataset was constructed. We can mechanically rewrite this query as two queries, send them in parallel to two servers or replicas, receive both datasets and post-process it to the usual tabular form. All of this would happen behind the scenes.
Data dependency
In the beginning it seemed like the LEFT JOIN operator tightly couples two tables: querying a single table is perceived as a lightweight operation, while introducing JOIN makes the query heavier.
But now we can see that the restricted LEFT JOIN actually provides a reliable opportunity for decoupling the work of processing both tables. You can process them both simultaneously, with or without the help of a database server. You can move tables between server instances. You can move fields between tables (we’re going to discuss the data migration later). There are even more unusual possibilities, that we’ll discuss in later posts.
Outcomes of this discussion
There is nothing groundbreaking here, as usual. But I think that it may be a useful foundation for our future discussions. Here is what we learned so far:
Unrestricted LEFT JOIN SQL operator is too complex and confusing;
Restricted version of LEFT JOIN is useful for the particular use case that we’re interested in, and is much easier to reason about;
The key performance characteristic of ID-based lookups is a low variability of response time;
There is a lot of accumulated wisdom and prejudice related to SQL joins;
Sometimes you can’t use LEFT JOIN operator directly (e.g. multiple servers), but you can always use the general idea of left join);
Restricted LEFT JOIN is a powerful way to decouple data processing;
Multi-part SQL queries help reduce latency;
Multi-part SQL queries should be more widely used;
SQL is more high level than the underlying database;
In the following post we can continue this discussion, investigating the spectrum between wide and narrow tables. Then we can talk about the other composability dimension: UNION ALL operator.
P.S.: follow me on Twitter: https://twitter.com/alexeymakhotkin.