Virtual 6NF
One of the goals of this substack is to research ways of removing historical cruft from the way relational model is taught. One thing that puzzles me particularly is why Sixth Normal Form (6NF) is historically treated so... reverently? Lots of texts on the internet consider 6NF to be “exotic”, “academic”, “never used in practice”, etc., etc.
As software developers, we can solve any problem by adding one level of abstraction, and database modeling is not an exception. I’m going to show that any database could be represented as a number of relations in either 6NF or 4NF, and why it’s useful.
NB: Any database is subject to 6NF decomposition
We won’t change the original database schema. Our 6NF relations would be purely virtual, like table views or just subqueries.
Using this approach, we can dissect any database, no matter what the physical table design strategy it uses. We can handle everything:
classic textbook 3NF approach;
Entity-Attribute-Value (EAV);
array-typed columns, like in Postgres;
JSON-typed columns of most forms;
etc., etc., including of course the usual uhm… hybrid mix that can be seen everywhere;
NoSQL is fine too. You don’t even need a classic relational database: MongoDB or DynamoDB are also representable. Even more exotic things could be handled, including possibly graph databases, but that moves us somewhat too far into the virtual realm.
Also, virtual 6NF decomposition is not restricted by a single physical database. Your company’s data could be stored in many different databases, using different database engines, and it still could be decomposed as per this approach.
The main hypothesis of Minimal Modeling
Minimal Modeling is built around the main hypothesis. Namely, that any database, any at all, could be represented as the number of:
anchors;
attributes;
links;
secondary data of all kinds.
Anchors, attributes and links are together called primary data. Each of them can be represented by a relvar (“relation variable”, basically a dataset or a table):
anchor is a trivial 6NF relvar (single column, ID only);
attribute is a simple 6NF relvar (two columns, ID + value);
link is a 4NF relvar (two columns, ID1 + ID2).
“Relvar” is a term from relational theory. Each such relvar is virtual (as befits such a theoretical concept indeed), it may not exist as an actual physical table. For clarity, we’re going to represent relvars as simple SQL queries.
Secondary data is all sorts of duplicated data: pre-aggregated datasets, pre-computed caching columns, copies of the same data in different tables, materialized queries, subsets of data and so on. The most important property of secondary data is that it could be fully regenerated from primary data. If you cannot regenerate something from primary data it means that it is itself the primary data.
A thought experiment in decomposition
The goal of this post is to conduct a thought experiment. We won’t actually be doing anything: no database refactoring, no need to write down the full list of every single element of your database, etc.
At least, initially. Our goal is to convince ourselves that all the data in our database, whatever it is, could be represented as a list of virtual 6NF relvars (tiny subqueries, basically).
Anchors
Anchors roughly correspond to entities: users, posts, orders, invoice lines, etc., etc. Anchors do not contain any data, they only manage IDs. So, we need to write an SQL query that returns every ID of a certain anchor.
For example, for the User anchor we could write the following query:
SELECT id
FROM users
That’s it, a single-column query. This query needs to return the same number of rows as the number of users that are registered in your database. If we have 200 customers then this query needs to return 200 unique IDs.
As for the data type of an ID, it could be any usual type: integers, UUIDs (interpreted as very big integers or as strings), strings.
In classic relational table designs, you could also have a tuple data type (to represent composite primary keys, more on that in a follow-up post).
Attributes
Attributes carry the actual data about anchors. Some common examples are:
name of the User;
Payment amount;
time when the Order was placed;
text of the Comment;
etc., etc.
Here, “User”, “Payment”, “Order”, and “Comment” are anchors.
We deal with each attribute separately. We need to implement an SQL query that returns two columns: ID and the attribute value. Example:
SELECT id, text
FROM comments
Here are the requirements for the resulting dataset:
ID is unique in each row;
IDs are the same IDs that were listed in the corresponding anchor dataset (here, “
SELECT id FROM comments
”);there are no NULL values in the second column;
moreover, there should be no other sentinel values in the second column, such as empty strings, nonsensical numbers (0, -1, etc.), strings like “N/A” and “UNKNOWN” that are just NULLs in disguise, etc.;
attribute values cannot contain anchor IDs (see the “Links” section below);
The number of rows in the attribute dataset may be less than the number of known IDs for that anchor. That is, we only want the “real” values of the attribute, without any placeholder junk (sentinel values and NULL). Because of that some SQL queries would include conditions, e.g.:
SELECT id, cancelled_at
FROM orders
WHERE cancelled_at IS NOT NULL
What we do here is that we basically clean up the data (virtually).
Note that in general attributes and tables could be completely decoupled from each other. That is:
attribute values could be stored in a different table from the one that was used in the anchor query;
one physical table can contain multiple attributes (a common example is the Entity-Attribute-Value modeling approach). You could have a three-column physical table that contains hundreds of different attributes.
Links
Links connect two anchors. Links are described by sentences that mention both anchors. Examples:
User likes Post;
Order includes Order Lines;
Employee is a manager of Employee (this is an example of anchor linking to itself);
Links roughly correspond to relationships. Links have cardinality (one-to-many, many-to-many, one-to-one). We treat all cardinalities the same way here.
Each link has a corresponding virtual query that returns two IDs. Here is an example:
SELECT order_id, id AS order_line_id
FROM order_lines
Here are the requirements for the resulting dataset:
each pair of IDs is unique;
IDs of both anchors cannot be NULL (and other sentinel values, like fake zero IDs);
IDs are the same IDs that were listed in the corresponding anchors datasets;
Sometimes links could be confused with attributes, and it’s important to clearly separate the two. This confusion arises because if you look at the physical table link and attribute can look similar. Here is an example table definition:
CREATE TABLE employees (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(128) NOT NULL,
manager_id INTEGER NULL
);
Here we have the following elements and their subqueries:
“Employee” anchor: “
SELECT id FROM employees
”;“Employee name” attribute: “
SELECT id, name FROM employees
”;“Employee is a manager of another Employee” link: “
SELECT manager_id, id FROM employees WHERE manager_id IS NOT NULL
”;
What we see here is that the columns “name
” and “manager_id
” are defined in a similar way, but one is an attribute and another is part of the link.
Note also that the order of columns matters! We need to clearly understand which ID is the manager and which ID is the team member.
What’s next?
As a result of our thought experiment we now have a long list of SELECT queries. We can see that theoretically we could have written down the entire list of such subqueries, no matter how big and complicated our database was. We could hire some people and ask them to do this work for us, give them access to our resident database experts, let them ask questions, and after some time they would actually bring the coverage up to 100%.
Could we ask an LLM to do that? That’s a good question, and our current understanding is that LLM would probably be able to do 80% of work (Pareto number), but the remaining 20% would require human involvement. (More on that here: https://minimalmodeling.substack.com/p/minimal-modeling-and-genai). Anyway, suppose that we could afford 100% of human-quality work here.
What can we do next, now that we think of our database as many virtual subqueries?
1/ We can see that there is nothing particularly exotic about 6NF. It’s just a decomposition of a bunch of data into minimal pieces. We could call them atomic, but this word already means something different in databases (atomicity of transactions) so we avoid it.
2/ We can note that we don’t need 100% of such queries, maybe we could have just a handful of subqueries that are most interesting for us at the moment.
3/ We can see that each SQL subquery is basically a documentation for that piece of data. It shows the table where the data is stored, how it is stored (is it a normal column, or a EAV attribute, or a JSON field, or any other physical representation). It also shows how to clean up this data: what are the sentinel values that need to be skipped, etc. We can use such SQL subqueries for knowledge transfer.
4/ We can make some of those subqueries more material. We could create a database view, if our database supports those. We could create a materialized view — an actual physical table. It would be narrow, efficient and clean. We could create a pipeline for this data element to keep this materialization up to date. We could copy the SQL fragment as a CTE into our larger query. We could have a library of such CTEs.
5/ We could look at the SQL fragment and decide to actually migrate the database so that it would have simpler, more efficient and clear representation. We could do that element by element.
6/ We could begin assembling a data catalog that is built on anchors, attributes, and links, as opposed to the traditional way of table-oriented data catalogs.
We’ll discuss building the data catalog from the existing database in a follow-up post.
There are some more niche topics that we did not cover here, to be posted some other time:
decomposing secondary data;
temporal 6NF representation;
3NF as another possible virtual representation;
single-column representation for booleans;
physical 6NF database design strategy;
thought experiment: re-assembling the original database;
etc., etc.