Few days ago I found myself in a peculiar situation. I’ve finished the last post with “Next week we’ll probably begin tackling the database normalization forms”. I was going to discuss the first normal form, or more precisely: how 1NF is taught and understood by people (including myself, because I’m not sure if I had a clear definition in mind). However, as I was procrastinating, I’ve opened Lobsters and a post titled “What does First Normal Form actually mean?” caught my attention.
This was a serious David Lynch moment: this post is exactly what I was going to write, up to some expressions. Yet, apparently, someone else wrote it. Maybe this particular topic just wants to be written about so much that it just invaded minds of several different people, you know, to overcome our collective procrastination.
I’ve just read this post (and highlighted a lot). It’s brilliant. It’s exactly what I would hope to have in this substack. Read it: “What does First Normal Form actually mean?” by Galimatias. (I must also say that had I not been procrastinating but instead worked on my version of this post, I would have found myself in an awkward situation on Tuesday.)
I considered just posting this link, but I can now build on this post and try to investigate this topic a bit further. First, let me summarize the key ideas from Galimatias’ post, as I understand them. The single most important statement is:
1NF is not concerned about values, only about types.
Also:
Codd et al. introduced normal forms a long time ago, and some of the historical context is a bit lost for us. They tried to distance themselves from some things that are no longer relevant (such as hierarchical databases). Nowadays the relational model is so pervasive that it’s hard to find a good example of a table design that is NOT in 1NF. That’s why many teaching examples confuse people: they are just not good examples.
1NF suggests that the types of columns should be “simple”. The word Codd used is “atomic”, but nowadays it has other meanings. I interpret this word as “scalar”. Galimatias adapts the original statement for the modern reader in the following way: ‘In more common database terminology, this means "Eliminate columns which can have tables as values".’
A big motivation behind 1NF is simplicity: you can store all kinds of data as two-dimensional arrays and each piece of data is identified by table name + primary key value + attribute name.
When I was doing my own research on this, I found a very short text by William Kent circa 1982: “A Simple Guide to Five Normal Forms in Relational Database Theory”. In the “Acknowledgements” section we see: “I am very grateful […] especially to Chris Date for helping clarify some key points”, so we can trust its authority on then-understanding of this topic. Here is how it describes 1NF (emphasis mine):
First normal form deals with the "shape" of a record type. Under first normal form, all occurrences of a record type must contain the same number of fields. First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn't deal with records having a variable number of fields.
So, yeah. Nowadays it’s hard to create a table with a variable number of fields (unless you use Excel). But let’s discuss the part that we called “scalar” (a.k.a. “atomic”).
Limits of firstness
Many people start learning database modeling with an idealized relational database that looks like a safe subset of commonly used database engines: there are tables, rows, columns, and values of different scalar types: number, string, boolean, date and time. Basically something that could be taught with almost no changes since the 1990s or even earlier. Such databases are in 1NF by construction.
But how could non-1NF database design look in practice? Is it even possible, using commonly available tools? (except for Excel, of course)
In the “Looking Back at Postgres” article by Joseph M. Hellerstein (2019) we find the following “Figure 1: Postgres features first mentioned in the 1986 paper and the 1991 paper” (emphasis mine):
1. Supporting ADTs in a Database System
a. Complex Objects (i.e., nested or non-first-normal form data)
b. User-Defined Abstract Data Types and Functions
c. Extensible Access Methods for New Data Types
Postgresql is known for its support for richer data types; we’re particularly interested in geometric types, JSON type, and array type. Let’s see how this “atomic” thing incorporates those types.
Array type and 1NF
Consider a very simple, straightforward and obviously 3NF table users with the following fields: id, real_name, loginname, year_of_birth. Suppose that we want to store a list of spoken languages that the user knows. The canonical relational way of storing that information is creating a new table called user_languages with the following structure:
CREATE TABLE user_languages (
user_id INTEGER NOT NULL,
lang CHAR(2) NOT NULL,
UNIQUE INDEX (user_id, lang)
);
But in Postgres, we have a different option: we can add a new column to the users table. This column would have the type of array of strings (note the square brackets):
CREATE TABLE users (
id INTEGER NOT NULL PRIMARY_KEY,
-- ... other fields ...
langs CHAR(2)[]
);
Both representations are almost equivalent. We can convert classical representation using ARRAY_AGG() function, something like:
SELECT user_id, ARRAY_AGG(lang) AS langs
FROM user_languages
GROUP BY user_id;
We can convert array-based representation using LATERAL JOIN (my hands-on knowledge of Postgresql is not up-to-date, so I may use incorrect syntax, but hopefully the general idea stays valid):
SELECT user.id, lang
FROM users INNER JOIN LATERAL users.langs AS lang
Many other typical queries that we may want to execute against both representations could be similarly worked out.
So, the question is: after we’ve added an array-types langs field to our users table, is it still in 1NF?
We may say that what we have here is a rough equivalent of the proverbial comma-separated text value that is one of the arguably misguided examples used on the internet for teaching 1NF. You may say that you could have had a varchar-typed column, and the table would certainly still be 1NF, because the database does not care about the structure of our strings:
CREATE TABLE users (
id INTEGER NOT NULL PRIMARY_KEY,
-- ... other fields ...
langs_csv VARCHAR(64)
);
I’m not sure what my personal opinion is at the moment. It feels weird to declare the users table as non-1NF because of an array-typed column containing a bit of data. Do we even want to pass that judgement, would it be useful in any way? What would change?
As a bit of a Mt 4:9 moment we could push a bit further and consider a teams table. Each team has a number of people in it. Where do we keep this association? Would that be the classical two-column team_members table, or the array-typed team.members column?
The latter design would obviously go against Codd: "Thou shalt eliminate domains which have relations as elements" (quote may not be exactly correct). But is it?
As we said, both representations are equivalent in a sense that we can convert one to another and vice versa.
Why bother?
It’s really unclear if we must agree on the firstness of our tables even when using modern non-scalar data types.
But there is one important teaching aspect: many texts teach you that normal forms are somehow successive: that is, lower forms are prerequisites for the higher forms. In addition to other sorts of confusion outlined in the “What does First Normal Form actually mean?”, this may introduce unnecessary anxiety.
This is also, I think, one of the reasons why we should not teach 1NF, at least in the beginning.
There are several other interesting aspects of 1NF, which we’ll discuss in the following posts (unless somebody else writes them for me).
"Bumped into" your writings recently, and was particularly intrigued by your reference in "Making Sense of 1NF, pt. II" to the archival piece by William Kent ("A Simple Guide to Five Normal Forms ..."). Alas, the discussion of 5NF again disappointed me. As it happened, I was finalising a new addition to my writings about this, which I've now posted ("5NF: The Missing use Case") at DataDemythed.com. In case you'd care to read it for a different perspective.