Since I’ve published the “Systematic design of multi-join GROUP BY queries”, I’ve been thinking about a number of follow-up questions:
Why isn’t it common knowledge really?
Is there any canonical text explaining all of that, so that I could replace this post with a link to that canonical text?
Why do some LLMs make the same mistake as humans do?
Why do some other LLMs solve the problem correctly from the first try?
How can we better teach this to both people and LLMs?
Let’s investigate this, starting from the simplest case of one aggregate column. We’re going to use the same “social network analytics” problem as explained in the beginning of the original post, see the link above. We’ll use the schema given at the end of the original post.
(See the last section below, “The contribution of this work”, for a summary of this post.)
Base case: one aggregate column
So, let’s implement a trivial query: a list of user IDs, accompanied by the number of posts they have made. Note that we want to include the IDs of users who did not post anything.
Direct table join approach
Here is the query that many people, and some LLMs, would probably write.
SELECT users.id, COUNT(posts.id)
FROM users LEFT JOIN posts ON users.id = posts.user_id
GROUP BY users.id
This query does what we need, it’s efficient, it automatically handles users without posts. Some things to note, though.
Note the “COUNT(posts.id)
” fragment. We should be careful to not accidentally use “COUNT(*)
”. The former counts the number of non-NULL IDs; for users without posts this number would be 0. The latter counts the number of rows that contain each user ID; for users without posts this number would be 1.
We will see shortly that this query actually contains a potential bug. It doesn’t trigger in the base case, but it would trigger when we try to add one more table and an aggregated value.
Query processing estimate
Let’s look at the underlying conceptual dataset of this query:
SELECT users.id AS user_id, posts.id AS post_id
FROM users LEFT JOIN posts ON users.id = posts.user_id
What do we mean by “conceptual”? For the database server, the most straightforward way to execute the original query is to generate the underlying dataset and then process it by grouping and calculating the number of posts. Real query engines could try and find the opportunity for optimization, and find the result faster. But it’s important to understand the upper bound of compute effort required for such queries.
In the test database that I am using for experiments there are 62 users and 54573 posts. The size of the underlying conceptual dataset for the base case query is slightly larger: 54601 rows. This is because there are 28 users without posts, each of them adds one output row to the result of the LEFT JOIN.
Note that we intentionally do not discuss the execution plan for this query, because we want to keep this text more or less database-independent. We could expect that modern databases should all handle such simple queries in a roughly similar fashion.
If you do look at the query plan, there is nothing surprising there. The tables are well-designed, with all the necessary indexes in place. The question is how well the queries are designed!
Alternative query design: CTEs
Even though for this trivial case it’s not necessary, let’s try and design another query, using the approach suggested in the original post.
The unique key is “user_id
”. The unique key subquery is trivial:
SELECT users.id AS user_id
FROM users
The aggregate subquery is also trivial:
SELECT posts.user_id, COUNT(*)
FROM posts
GROUP BY user_id
Now, let’s combine those two subqueries using a LEFT JOIN. Note that we join the entire subqueries, and not the tables!
WITH user_ids AS ( -- unique key subquery
SELECT users.id AS user_id
FROM users
),
posts_count AS ( -- aggregate subquery #1
SELECT posts.user_id, COUNT(*) AS posts_count
FROM posts
GROUP BY user_id
)
SELECT user_ids.user_id, COALESCE(posts_count, 0)
FROM user_ids
LEFT JOIN posts_count ON user_ids.user_id = posts_count.user_id
The query runs roughly as fast as the direct table join. The query plan does not present any surprises here (it introduces a temporary table).
One thing to note is that we had to use a “COALESCE
” function. It is needed because for users without posts we’d have NULL instead of 0 otherwise. This is completely logical if you think about how LEFT JOIN works on two subquery results.
Another thing to note is that we could use “COUNT(*)
” in the aggregate subquery, we didn’t have to use “COUNT(id)
”. At the same time, we could have used “COUNT(id)
”. Is it possible that there is a potential bug lurking here, the same as in our first query design approach?
Finally, let’s estimate the query processing effort:
the unique subquery requires processing 62 rows;
the aggregation subquery processes 54573 rows; it generates an intermediate dataset that contains a maximum of 62 rows;
the LEFT JOIN between subqueries is trivial (1:1 one ID corresponds to one ID), so it processes 62 rows;
We can see that the estimated effort for this query design is roughly the same as for the direct query join.
Alternative query design: correlated subqueries
Let’s look at a slightly different query design.
SELECT users.id,
(SELECT COUNT(*) FROM posts WHERE posts.user_id = users.id) AS posts_count
FROM users;
Wow, it looks surprisingly good. It’s very concise: no GROUP BY
, no need for COALESCE
. The execution plan looks great.
Let’s estimate the processing effort:
the main query requires processing 62 rows;
conceptually speaking, we could expect that for each of those rows we’d run the subquery, 62 times total. Each time we’d process the subset of the “posts” table, but together we’d process all 54573 rows;
Jumping ahead, we’ll see that modern databases don’t actually run the subquery 62 times. Instead a rough equivalent of LEFT JOIN
with GROUP BY
is executed. We’ll discuss it later.
Overall we see that this approach has roughly the same characteristics as the CTE-based one. However, we won’t be using this query design as the general approach to multi-join queries. We’ll discuss later why.
Later this design would also provide an interesting insight that would help our investigation of teaching aspects.
Extending the query
Now let’s see how all three query designs handle adding one more aggregated column. Let’s add not just another count, but something different: a timestamp of the most recent comment that the user posted.
Direct table join
Adding one more table, we get:
SELECT users.id, COUNT(DISTINCT posts.id), MAX(comments.created_at)
FROM users LEFT JOIN posts ON users.id = posts.user_id
LEFT JOIN comments ON users.id = comments.user_id
GROUP BY users.id
Problem 1: rows multiplication. We get a nasty surprise running this query: it’s going to take 12 minutes to return the result (compared to 0.05 seconds that it took to return the one-column result, 14500 times slower). Why? Our database is still quite small!
Let’s look at the underlying conceptual dataset of this query:
SELECT users.id AS user_id, posts.id AS post_id, comments.created_at
FROM users
LEFT JOIN posts ON users.id = posts.user_id
LEFT JOIN comments ON users.id = comments.user_id
Thinking about how two LEFT JOINs would work here, we deduce that for each user (62 of them), it’s going to process roughly the number of posts by that user times the number of comments by that user. Specifically, for the current contents of my test database the underlying dataset would contain 428 million rows.
In a sense, we got lucky: our database is quite small, but the data distribution is such that we immediately stumbled into problems due to Cartesian multiplication. If the data distribution would have been different, our query could run slower, but reasonably so, and we would have thought that our query is fine.
Also, our query would run fine on tiny datasets, say 100 users, 10 comments and 10 posts each. We’d just have 10000 rows in the underlying dataset, and wouldn’t notice anything wrong. Basically if you present such a query in a textbook, it’s possible that you wouldn’t realize how quickly the underlying dataset can grow.
It’s also possible that due to particular data distribution, changing the order of tables getting aggregated would cause drastic changes in the number of rows in the underlying dataset. LEFT JOIN is asymmetric, so it’s impossible to optimize it by automatically choosing a different join order, or something.
Problem 1b: query execution plan does not help. A fascinating aspect of this query is that its execution plan looks completely innocent (at least in the database that we use). Nowhere do you see the Cartesian multiplication aspect mentioned, and you don’t see a 428 million rows count, or anything close to it.
Problem 2: overcounting. If you look again at the query, you may notice that we used “COUNT(DISTINCT posts.id)
” , and not just “COUNT(posts.id)
”. We already discussed that in the previous section.
Suppose that our database has just one user, who has 10 posts and 20 comments. The underlying dataset for this specific case would have 200 rows, where each post ID is mentioned 20 times. If we used “COUNT(posts.id)
”, it would return 200 which is incorrect. So we have to add DISTINCT
.
Why didn’t we use DISTINCT
in the previous, one-aggregation query? Well, thinking about that, we could say that we actually had a latent bug in the previous query, and this bug is manifested here, by adding another joined table. So, maybe we should learn to always use “COUNT(DISTINCT some_id)
”? Are there any cases where the other two ways to count are even needed?
CTEs-based query design
Adding another aggregate subquery, we get:
WITH user_ids AS ( -- unique key subquery
SELECT users.id AS user_id
FROM users
),
posts_count AS ( -- aggregate subquery #1
SELECT posts.user_id, COUNT(*) AS posts_count
FROM posts
GROUP BY user_id
),
comments_created_at AS ( -- aggregate subquery #2
SELECT comments.user_id, MAX(created_at) AS comments_created_at
FROM comments
GROUP BY user_id
)
SELECT user_ids.user_id, COALESCE(posts_count, 0), comments_created_at
FROM user_ids
LEFT JOIN posts_count ON user_ids.user_id = posts_count.user_id
LEFT JOIN comments_created_at ON user_ids.user_id = comments_created_at.user_id
Much better! This query runs slower than a one-column CTE-based one, but reasonably slower (say, 0.35 sec vs 0.03 sec).
Let’s estimate the query processing effort:
the unique subquery requires processing 62 rows;
first aggregation subquery processes 54573 rows; it generates an intermediate dataset that contains a maximum of 62 rows;
second aggregation subquery processes 203660 rows; it generates an intermediate dataset that contains a maximum of 62 rows;
both LEFT JOINs between subqueries is trivial (1:1 one ID corresponds to one ID), so it processes 62 rows;
Overall, it looks very plausible.
CTE-based query is certainly verbose: 18 lines vs 4 lines of direct join-based query. But, ugh, the latter does not work for us, so it may not be a good argument.
Correlated subqueries
Let’s extend the query using correlated subqueries:
SELECT users.id,
(SELECT COUNT(*) FROM posts WHERE posts.user_id = users.id) AS posts_count,
(SELECT MAX(created_at) FROM comments WHERE comments.user_id = users.id) AS comments_created_at
FROM users;
This one is really good, it’s even faster than a CTE-based query. (Of course, that may not always be the case.) Query processing estimate is roughly the same as in a CTE-based query.
Full-size query
Our original query has six aggregated columns, using five tables, some of them twice.
It’s clear that adding more tables to the direct table join query is hopeless: it will never get better, but it can quickly become much much worse. We cannot control it because it completely depends on the data distribution pattern.
The full text of the CTE-based query could be found in the original post (https://kb.databasedesignbook.com/posts/systematic-design-of-join-queries/#joining-ctes). It runs in less than a second, which is a very reasonable time, given that it calculates six aggregated values, and also filtering and sorting. Here is a breakdown of table sizes:
users
: 62;posts
: 54573;comments
: 203660;likes
: 171063 (joined with posts, remember);subscriptions
: 78 (this table is processed twice; for testing, we temporarily change the condition to “> 1
” so that it would generate some results);timelines
: 702 (joined withsubscriptions
, processed twice);
I’d expect the slowdown to be logarithmic as the number of rows in aggregated tables increases; and linear as new aggregated columns are added. Of course, it should grow linearly as the number of users increases.
In other words, this query design is scalable.
At the moment I’m not doing an experiment with correlated subqueries. I expect it to have the same performance characteristics as CTE-based ones.
Interrogating LLMs
Our overarching goal is to help people write more powerful SQL queries. I believe that this approach to building analytical SQL queries is not common knowledge.
It’s entirely possible that I am simply not aware of some source text where all of this is explained in a similar way. If you know any such text, please share the link (or bibliographic information).
So, let’s discuss why it’s not common knowledge. LLMs are perfect in that regard, because you can ask them a question (“generate an SQL query”), and then ask why this particular approach was used. You can treat it as an “average student”.
I’m going to mention some well-known LLMs by name, but I intentionally omit details like which specific model was used, and which specific prompt was given. It is not a tutorial on generating SQL using LLMs. On the contrary, we want our prompts to be “naive”.
So, we can give any modern LLM an SQL schema definition (see the end of the original post), and ask it to generate a SQL query that returns the list of columns from the “Main example” section. Here are the responses:
ChatGPT generates direct table joins by default;
Claude generates CTE-based query;
DeepSeek generates direct table joins;
The question is basically why they “don’t know” that direct table joins are so unreliable, performance-wise.
Interrogating ChatGPT
I had a fascinating discussion with ChatGPT. First I asked it to generate a base case query: only one aggregated column. It generated a direct table join-based query. Now we confirmed that it works only for the base case, or if you’re lucky with the data distribution.
I asked what other ways to generate such a query does it know. It knew all three: direct joins, CTE-based and correlated subqueries. It considers CTE-based queries “slightly more complex”.
Here is what it said about correlated subqueries: “Can be inefficient on large datasets (one subquery per user)”. This was “aha!” moment. Why does it think so? This is generally not true for many years, we know that. Who told it that, basically?
Next I asked it to add one more column to that query, as we did above, and generate all the approaches that it knows. For direct table join it specifically mentioned: “COUNT(DISTINCT)
may be needed to avoid overcounting due to comment joins”. Nice, it explicitly knows about the overcounting problem.
For correlated subqueries, it repeated the “one subquery per user” warning, and even highlighted “inefficient”.
For CTE-based approach, it only comments on the query itself: “PRO: Best for readability and reusability; PRO: Clean separation of logic; CON: Slightly more boilerplate”. It does not mention performance characteristics at all.
It doubles down against correlated subqueries in summary: “Very readable, but slow”. Here I finally asked, why does it hate correlated subqueries so much? It turns out that its knowledge is quite dated, and it thinks that only most advanced query engines are able to convert correlated subqueries into a JOIN. Then it admits that “most advanced query engines” is basically “virtually all of them”. Then it admits that actually there is nothing wrong with correlated subqueries.
How much better am I than the LLM?
This is all the usual experience of talking to an LLM about something that you actually know a lot about. But this is where I realized that my own understanding of correlated subqueries is very similar. Before I started writing about this (a few months ago), I knew about correlated subqueries, but they would not have been my go-to approach. Just looking at the syntax it’s clear that it should be roughly equivalent to JOIN. That is, there should not be a “one subquery per each row”, in theory. But I wouldn’t have been sure whether it was actually the case or not, and how each database would have handled that.
Also, it’s clear that “one subquery per each row” here means something different, it’s not that we send N identical subqueries to the server, one per each output row. It’s possible that it’s even like this technically, but those subqueries are much lighter than the normal queries. And I don’t know what was the situation with correlated subqueries 20+ years ago, and across different query engines. Is it a baseline now? I’m pretty sure that some specific commercial database, such as Oracle, learned to do that earlier than everyone else. And so on, and so on. This needs further investigation.
One other thing that I wanted to think about is why any combination of LEFT JOINs + GROUP BY
cannot be automatically optimized away into the same query plan as CTE-based approach? That’s a topic for another post though.
The contribution of this work
Writing a text like this puts you in a difficult position. The requirements are often contradictory:
we don’t want to discuss specific optimizations for specific database servers;
at the same time, we want to discuss up-to-date capabilities of modern databases;
So we have two different risks:
giving general advice that doesn’t work in reader’s specific situation because their database is not capable enough;
not mentioning specific advice that would be perfectly suited for a specific situation because their database is actually a positive outlier;
Also:
the general approach may not actually work for huge datasets (even though we think that it should);
for smaller datasets this approach may be an overkill: a simpler query may well be faster;
The main contribution of this work is the entire framework of structuring analytical SQL queries of this type. We show that a query can be broken down:
column-by-column: adding aggregated columns predictably increases query complexity;
across select/display barrier;
batch-by-batch: you could paginate over the unique key, and generate results incrementally;
across pipeline nodes: our approach to query design directly maps to a pipeline graph, this enables distributed processing of the logical query.
Our approach helps you validate the correctness of your queries because you can run each of the subqueries separately to confirm that they return the expected result. Subqueries are assembled in a way that preserves correctness.
Our approach helps you validate the correctness of your queries by requiring you to explicitly declare the unique key of your resulting dataset.
Discussing the direct table join approach, we demonstrated that a good execution plan is not enough. We showed that some queries would have a perfect execution plan, but would run for an unacceptably long time.
Particularly, having all necessary indexes may not be enough. In other words, table design is not enough: you also need to design your queries for scalability.
We showed that some queries are sensitive to the specific data distribution in your database. We demonstrated that our query design is much more tolerant to data distribution variance.
A better understanding of COUNT(id)
, COUNT(DISTINCT id)
and COUNT(*)
is needed. This may be the key to understanding the general overcounting problem in presence of JOINs. A separate issue may be the oversumming problem that we did not discuss here. However, our approach to query design seems to avoid most cases of overcounting, and probably oversumming.
Particularly, COUNT(id)
vs COUNT(DISTINCT id)
demonstrates a common problem: you can underspecify your requirements, but still get correct results because, say, uniqueness is accidentally guaranteed by some other means. Changing the query will break this guarantee, exposing the bug.
Generally speaking, scalability is not necessarily about speed, it’s rather about reducing variance.
LEFT JOIN should be taught in an entirely different way than it is taught currently. INNER JOIN too, to a lesser degree.
We demonstrate the idea of estimating query processing effort by looking at the conceptual underlying dataset. It is close to the way we think about query plan optimization using indexes, but is distinct from it. This idea requires further research.
P.S. Here is another link to the original text, for convenience: https://kb.databasedesignbook.com/posts/systematic-design-of-join-queries/. It explains the approach directly, without investigating alternatives and context.