Systematic design of multi-join GROUP BY queries
I wrote a long-ish (5000+ words) tutorial on systematic design of complex analytical SQL queries. It is aimed at people who got stuck trying to write a long complicated SQL query with:
many JOINs,
lots of subqueries,
the use of GROUP BY and aggregation functions,
the DISTINCT keyword.
One typical problem of such queries is data duplication (more rows than expected), or value multiplication (a numeric value is two or three times larger than it should be). Typically such queries are at least a couple of dozen lines long, and often hard to read and understand.
People ask for help with such queries, but it’s hard to actually help them without direct access to the database. This text defines a framework that helps design queries that are correct by construction. Also, it allows you to validate the correctness of different pieces independently.
Here is the link: https://kb.databasedesignbook.com/posts/systematic-design-of-join-queries/.
I’d greatly appreciate any feedback: did it help you with structuring a problem that you have right now? Could it have helped with a query in the past if you’ve used this approach? Do you know a better way to do that? Do you know other texts that discuss this topic and provide a useful framework? Do you see any problems with this approach?
I’m not even sure about the title. It sounds fancy but I feel that maybe it doesn’t quite cover all the types of problems the post addresses.
Unfortunately, it’s hard nowadays to search for “complex SQL query design” and hope to find anything useful. In a median article on the web “complex” often just means “slightly harder than a JOIN”, or “uses a window function”. This is disappointing.
I hesitate to share some motivating examples of Reddit posts that inspired me to write this, but it seems that every week or more often you see a post that could be directly helped by the approach explained above.