VERTICAL JOIN, pt. I
In the previous posts we’ve discussed the idea of left join. Today I’d like to talk about its counterpart: the UNION ALL operator.
UNION ALL is used much less in practice than the LEFT JOIN operator. However, the concept is important, and some use cases illuminate interesting ideas. In this post and the next one we’re going to discuss: constant datasets, some notation issues, vertical joins, sharding, and polymorphic relations. The idea of concatenability would also be relevant to this discussion.
Syntax and notation
UNION ALL operator allows to combine two or more compatible independent queries. The syntax looks like this:
SELECT <query1> UNION ALL SELECT <query2> …;
Two queries are compatible if their results have the same number of columns, and the columns have the same data type. The resulting dataset of UNION ALL query consists of the result of the first SELECT query, concatenated with the result of the second SELECT query.
In standard SQL there is a variation of UNION ALL operator, called just the “UNION”. It works similarly to UNION ALL, but it additionally makes sure that if there are duplicate rows in both subqueries, the duplicates will be removed (so only unique rows are present in the output).
Of course, the database server must do additional work, and UNION query is generally going to be less performant than the UNION ALL query. In many cases you don’t really need this uniqueness requirement: for example, it is satisfied by some other means, or your code is going to handle duplicate rows gracefully. In some cases you can get additional speed if you replace UNION with UNION ALL.
If I would be able to magically improve SQL, this is one of the things that I’d do: make UNION simply concatenative by default, and make a separate UNION DISTINCT that would return unique rows. This makes sense from the affordance perspective: heavier operations should have longer names, and vice versa.
Every programming language has some syntax for the constant values of commonly used data types, such as numbers (e.g. 36.6) or strings (e.g. “hello world”). Sometimes other data types are supported, such as booleans (e.g. true). SQL also allows constant syntax for tuples (e.g., (2, “foo”), which is useful for some queries). But what if we want to specify a constant dataset, for example:
| colA | colB | --------------- | 2 | foo | | 5 | quux | | 7 | gvar | ---------------
Suppose that you have a completely empty database, without any tables, yet you want to construct a query that would return exactly this dataset.
There are some practical application of constant datasets: for example, they could be used for batch-updating a lot of data (we’re going to discuss this later). But for now we’re primarily interested in them because we want to better understand the algebra of queries. LEFT JOIN allowed us to add more columns to the existing dataset; UNION ALL allows us to add more rows.
Here is how to generate the constant dataset shown above:
SELECT 2 AS colA, "foo" AS colB UNION ALL SELECT 5 AS colA, "quux" AS colB UNION ALL SELECT 7 AS colA, "gvar" AS colB;
And the result would be, just as expected:
+------+------+ | colA | colB | +------+------+ | 2 | foo | | 5 | quux | | 7 | gvar | +------+------+
Of course, we’re not going to construct this kind of queries manually: they’re going to be auto-generated.
Note that the SQL query text does not reference any tables. Some database servers do not allow this, but they have a special pseudo-table for that, traditionally called DUAL. So, instead of “SELECT 1;'“ you may have to write “SELECT 1 FROM DUAL;”.
In the “Many faces of left join” post, “JOIN-free queries” chapter we showed that you can reimplement a query containing the LEFT JOIN operator as two separate queries and postprocess their results. We can notice that the same is true for the queries that use the UNION ALL operator.
That’s why UNION ALL is not often used in practice: often you can just make several queries and post-process their results. For example, suppose that you want to find a list of active users. We define “active user” as the user who either posted something, or liked something, or commented some post (hmmmm, there is an “either/or/or” pattern here, could this mean something? Let’s discuss it later.)
To find the list of IDs of such users we could use the following UNION ALL query:
SELECT DISTINCT user_id FROM posts UNION ALL SELECT DISTINCT user_id FROM comments UNION ALL SELECT DISTINCT user_id FROM liked_posts;
Some ids would be repeating, but our code would most certainly be able to ignore them.
However, the same task could be implemented by directly executing those three subqueries, separated by semicolons, and combining their results in the code. Just like for the JOIN-free queries, we can immediately see that those three queries could be executed in parallel, maybe even using different database replicas for speed. We can thus substantially reduce the query latency.
We could say that LEFT JOIN both combines different columns, and provides a way to cleanly split processing of independent attributes. Similarly, UNION ALL both combines different rows, and provides a way to cleanly split processing of independent datasets.
That’s why we could say, tongue-in-cheek, that UNION ALL really could better be renamed as VERTICAL JOIN.
Usually the posts in this substack have about 1500 words. As an experiment, this post was about 50% shorter. In the next issue, we’re going to discuss:
More on improving the SQL notation;
Polymorphic relations. This is a huge topic that will be discussed separately in the same vein as the either/or saga, but we’ll begin with the role that UNION ALL plays here.
P.S.: follow me on Twitter: https://twitter.com/alexeymakhotkin.
Hold my beer:
select * from (values(2, 'foo'), (5, 'quux'), (7, 'gvar')) tmp(colA, colB);
Valid in SqlServer and PostgreSQL. SQLite won't let you name your columns, but you can get there via a CTE, or omit them entirely if you don't care (they'll be named column1, column2).
I don't know about other databases.