Historically, 4NF explanations are needlessly confusing

Dec 05, 2024

I was always confused by the way the fourth normal form (4NF) is explained on the internet. I mean, if you want to design some actual database tables in 2024, do you need to keep 4NF in mind? Also, why is it so hard to understand texts about 4NF, no matter how many times you read the same material by different authors? I was nerd-sniped by those two questions for some years. Recently I had some historical insights and here is my current understanding of how this all came to be.

First we will discuss a simple example of business requirements, and how to design tables that let us store this information. This is a zero-surprise section. The design presented in this section is, to the best of my understanding, in 4NF.

Second, we’ll discuss a notion from 4NF that is unnecessarily complicated: multivalued attributes, and why it’s just the list of IDs.

Then we’ll discuss this pattern of presenting 4NF in a backwards way: a) presenting a weird table design b) decomposing it into independent relations c) declaring that we have now reached 4NF. Here I claim that this is just the wrong and confusing way to teach 4NF (if there is anything to teach even).

Then we’ll discuss the historical perspective of how 4NF was introduced and how it was reflected in follow-up papers. We’ll use primary source materials written in 1977, 1982 and 1992. We’ll try to guess why they needed to present this topic in this particular sequence of exposition. We’ll also look at modern texts, starting from Wikipedia, some random articles on the internet, and analyze what ChatGPT has to say about 4NF.

Baseline: M:N link table design

Let’s forget about 4NF for a moment, and just design some tables.

Suppose that we’re building a system that lets people find sports instructors. Each instructor has a number of skills they can teach: yoga, weightlifting, swimming, etc. Also, instructors can speak one or more languages: English, French, Italian, etc.

How do we design tables in a typical relational database (such as MySQL or Oracle) that will store this information? First, this is not a trick question. Second, the idea is not to get fancy. We’ll implement the design that would be supported by basically every classic relational database, starting from the early eighties.

We have three entity tables:

instructors: (instructor_id, full_name);
skills: (skill_id, skill_name);
languages: (lang_id, language_name).

When we hire a new instructor, we insert a record into the “instructors” table. When we decide to teach a new skill, we add a row into the “skills” tables. When there is a market demand in a new language, we add a row into the “languages” table. There is nothing surprising here.

Now how do we store instructor’s skills and instructor’s languages? A very common and non-controversial implementation is to have a two-column table for skills, and another similar table for languages.

For readability, let’s use the following IDs:

for instructors, ID just a number without a specific meaning (1, 2, 3, …);
for skills, we’ll use a a machine-readable short string in lowercase (“yoga”, “weightlifting”, “swimming”, …);
for languages, we’ll use ISO-639-1 two-character strings (“en”, “it”, “fr”, …);

Here is a table that stores skills information, with some example data:

It has two columns (and we don’t need more). Here is how we can read it:

an instructor with ID=2 teaches yoga and Pilates;
an instructor with ID=3 teaches weightlifting, CrossFit and boxing;
an instructor with ID=5 has no rows in this table, probably because they just joined our system and did not enter their details yet;

The table that stores language information is nearly identical:

It also has two columns (and we don’t need more). Here is how we can read it:

an instructor with ID=2 speaks only English;
an instructor with ID=3 speaks English, Italian and French;
an instructor with ID=5 speaks English (and they managed to enter this information about themselves);

There are alternative ways to organize that data, but we’ll talk about that later.

This design is completely non-surprising. Absolutely nobody would blink if they see similar tables in a real-world relational database. It is completely expected. If you present this design on a job interview, it will virtually always be accepted as an adequate answer. You will see such design in thousands of books, university courses, tutorials, real-world databases etc., etc.

How can we summarize this in a practical way, without mentioning any normal forms? Imagine that your less-experienced colleague approaches you and asks for advice on implementing those requirements, what would you say? You don’t need to provide an academically bullet-proof lecture on relational theory. My response would probably be:

“If you want to link two entities (e.g.: Instructor and Skill), and this link is many-to-many, you can create a new table with two columns: ID of first entity and ID of second entity.
You also need to define a primary key because every table needs one. Here a primary key would consist of both columns (e.g.: PRIMARY KEY (instructor_id, skill_id)). Note that the primary key here would also prevent data duplication.
Also, for every similar link you can just go ahead and create a new table, it’s fine: creating new tables is free. You also don’t need to worry about performance now (particularly, joins); even if this problem arises, we’ll solve it.”

There are many points to discuss here in depth, there are many statements that allow alternatives. But I still think that following this advice will take you a very long way.

In the upcoming “Database Design Book” correct table design is introduced by first explicitly defining links in a structured way. Database tables are defined based on this, using a clean standard two-column approach.

The surprising thing about this table design is that it’s apparently in 4NF. So if you follow this advice, you apparently reach the same design as you would have reached if you followed the 4NF definition as commonly seen in the literature?

Reference materials

In the rest of this article we’re going to refer to the following list of publications:

Ronald Fagin “Multivalued Dependencies and a New Normal Form for Relational Databases”, 1977 (https://dl.acm.org/doi/pdf/10.1145/320557.320571); [Fagin 1977]
William Kent “A Simple Guide To Five Normal Forms In Relational Database Theory”, 1983 (https://dl.acm.org/doi/pdf/10.1145/358024.358054); [Kent 1983]
Margaret S. Wu “The Practical Need for Fourth Normal Form”, 1992 (https://dl.acm.org/doi/pdf/10.1145/134510.134515); [Wu 1992]
Wikipedia contributors “Fourth Normal Form”, 1 December 2024 (https://en.wikipedia.org/w/index.php?title=Fourth_normal_form&oldid=1232923248); [Wiki 2024]
ChatGPT 4o, a response to a prompt “how relevant is database Fourth Normal Form in practice?”, 1 December 2024 (https://chatgpt.com/share/674cd27f-f3cc-8012-9d18-d8fa1e5c765b); [ChatGPT 2024]

Multivalued attributes are just lists of unique IDs

If you read any text that defines 4NF, the first new term you hear is “multivalued dependency”. [Kent 1983] also uses “multivalued facts”. I may be dumb but I only very recently realized that it means just “a list of unique values”. Here it would be even better to say that it’s a list of unique IDs.

Taking this into account, we can look at the instructors/skills/languages problem from a different (but equivalent) point of view.

Let’s forget about the table structure for now, and ask the question: “what does an instructor with ID=2 teach?” A: “Instructor with ID=2 teaches yoga and Pilates”.

Similarly, we can ask a different question: “which languages does an instructor with ID=3 speak?” A: “ID=3 speaks English, Italian and French”.

It’s 2024, a lot has changed since 1977 and we can actually treat a list of values as a value. Here is an alternative design of a table that stores skills information, with the same data as in the example above:

This table has a row per each instructor. The list of skills also has some additional constraints: order does not matter, and you are not allowed to have the same skill twice.

Again, here is not a place to make this theoretically bullet-proof. The point is that “multivalued dependency” can be thought of as a list of unique IDs; you could even say that it’s a set of IDs (in the mathematical sense of this word).

In real-world databases, you can even implement this design directly. Postgresql supports array-typed columns. Many modern databases support JSON-typed columns, and so you can use JSON arrays. I’m not saying that you should, but you can.

There are many interesting things to say about this design, both from the theoretical point of view and from technical points of view. We won’t do that here.

I don’t know about you, but for me this insight was a crucial part of understanding that there is, in a sense, not a lot of “there” in 4NF.

On single atomic values

For me this was another point of confusion. As [Wu 1993] states: “The only attribute values permitted by 1NF are single atomic values”; and this is a very common understanding. It seems, though, that it is in conflict with the idea of “a multi-valued fact about something”. As far as I understand, our ancestors went to some lengths here so as not to suggest that you can think about this as an array or something. They seemed to have enough trouble with people misunderstanding the relatively new relational model in different ways.

But in 2024, it’s very natural and convenient to represent “multi-valued fact” as a list, and maybe even treat it as an attribute of sorts?

The reasoning would go roughly like this. What could “single atomic value” mean? Does it mean only a scalar value, like 5 or “Lisbon”? Would a tuple of (10, “EUR”) still be considered “a single atomic value”? (I’d say yes.) Well, the next obvious step may well be to treat [“en”, “it”, “fr”] as a single atomic value. After all, that’s exactly what Postgresql did. Basically, “something that can be used as a column value”.

Again, if you squint hard enough, “single atomic value containing an array” and “multivalued fact” seem to be just two sides of the same coin.

Presenting 4NF via non-4NF

Many, if not all of the texts that describe 4NF present it in the following way:

show a weird table design with three columns;
“decompose” this weird table design into two tables with two columns each;
declare that the latter form is in 4NF (as it should be).

The main point of this text is that this round-about way of presenting 4NF is completely unfounded and confusing.

[Kent 1983] is valuable for us because it was written specifically for brevity and clarity. Let’s look at the example presented there (section 4.1):

The example here strongly resembles our instructors/skills/languages use case, discussed above. The text says: “Under 4NF, these two relationships should NOT be represented in a single record. Instead, they should be represented in the two records”. I mean, yeah, I completely agree, but the question is:

Why did you even introduce this weird single-record (employee, skill, language) design in the first place?

Kent spends the entire second column of this page, and a half of the following page discussing this weird “single-record” three-column design. It’s fascinating reading, after you understand what’s going on. Unfortunately, it’s very confusing if you don’t. Kent spends a lot of time trying to find some sense in this representation. He introduces five(!) variations, every one super awkward. Let’s look at three of them and make sure that we understand what sort of information is represented here.

Each of those three variations say only that:

Smith can cook and type (has 2 skills);
Smith can speak French, German and Greek (speaks 3 languages);

Note that “cook” and “French” are not related here, they are just on the same line, and that means nothing. Note also that “type” is mentioned twice in variations (a) and (c), but that means nothing, it’s just one skill.

Nobody ever would actually implement this design in 2024 (and in many years before that), because it is just awkward and unusable in practice. It makes no sense.

Anyone would have implemented it as two tables:

and

People learning the basics of relational table design must learn this design first. Instead, they are being presented with the wrong version upfront. Why?

You don’t need to learn to “decompose” anything because you shouldn’t have started with the “composed” version to begin with! I mean, I understand that it’s possible that a very confused novice somehow conjures the composed design, and you may have to correct this error, but you can do that in “common mistakes” section.

Why do you even need an entire “theory” behind 4NF specifically? The existence of the theory implies that the “composed” design somehow arises, naively(?) or naturally. Because of that you supposedly have to algebraically prove the necessity to use a different, normal design. Well, couldn’t you just say “don’t do this, just use two-column tables instead”? Maybe using “ties” from Anchor Modeling would be a better approach.

I’m completely puzzled by why this pattern of presenting 4NF gets repeated over the years, again and again.

[Fagin 1977] does that (see “Employee / Salary / Child” table).
[Kent 1983] does that, as we just discussed;
[Wu 1992] does that, with the “(ACCOUNT ID, HOME PENDING CODE, FIELD PENDING CODE)” relation;
[Wiki 2024] does that, with the “Restaurant / Pizza variety / Delivery area” example;
ChatGPT does that, with a “Professor / Subject / Committee” example;
many university courses, YouTube tutorials and books do that;

Once you see it, you can’t unsee it.

Why are they doing that? (If you can shed some light on this particular question, please leave a comment.)

Why did they have to do that?

Sometimes rhetorical questions may have actual answers. [Wu 1992] is a precious source of information for every historian of relational model, such as myself.

There we can find the first data point to bring us closer to an answer. Out of 40 databases that were investigated as part of that study, 9 (nine) contained 4NF violations. I assume that non-4NF representations were the same as shown above: weird “composed” three-column designs. We don’t know which of the five variations according to [Kent 1983] were used, and how often. But we can be sure that this was a very real problem.

We can get the second data point by looking at the publication dates. The Paper (E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”) came out in 1970. [Fagin 1977] was written only seven years later (and we know that the speed of time was much, much lower than today).

Only a very small number of people today may even remember what were the alternatives to relational model at that time. The more I read about early years of relational model, the more I see that for us it may just be hard to understand what people did not understand then. Things that we take for granted today were just unfamiliar and probably confusing then.

This is my main theory of why in the old days educators had to contrast 4NF against this weird “composed” table design. Apparently the overall ecosystem somehow heavily nudged people towards this design, and the only way to convince them otherwise was a mathematical theory.

This does not explain why this pattern of presenting bogus composed design and triumphantly decomposing it persists. Also, I am not sure why the idea of “multivalued dependency” continues to be the foundational theory for 4NF. It just goes against the grain of everything that we learn to know as part of database practice: both design and even simply using the database to store data.

Summary and conclusion

1/ Informally, the fourth normal form is just a two-column table that contains pairs of IDs. E.g., if you need to store “developers assigned to projects” with the many-to-many relationship, you just create a table with two columns: (developer_id, project_id). This is also called junction table, or link table. This is a very common design, you can see it everywhere. It is in 4NF.

2/ Traditionally, starting from 1977 when 4NF was introduced, it is defined in a weird roundabout way. First a “composed” three-column design is presented; then this design is “decomposed” into two two-column tables; those two tables are in 4NF. It’s never clear where the initial design came from. This design could only be seen in the texts that explain 4NF.

3/ “Multivalued dependency” is basically a list of IDs. For example, if we need to record the list of languages that employees can speak, here is the dependency: employee_id=25 ← [“en”, “it”, “fr”]. You don’t need to learn this term to successfully design database tables. You only need it to understand traditional explanations of 4NF.

4/ You, as an educator, probably don’t need to present 4NF in this round-about way, you can present it directly as independent two-column tables. I invite you to omit the traditional way of presenting 4NF and just begin with the actual industry practice. Even experienced practitioners such as myself have trouble understanding pragmatics of the traditional way and extracting value from it.

If you have any further insights on why 4NF was presented this way, please leave a comment, I’ll be happy to learn.

ALEKSANDR

Dec 7

Alexey, I absolutely agree with you. These traditional formulations are difficult to understand. The ability to explain complex in simple is very valuable.

Expand full comment

Barry Johnson

Dec 15

# Presenting 4NF via non-4NF

Yes, we'd expect that an explanation of (n)th Normal Form (NF) would start with an instance in (n-1)th NF and explain why we need to make the additional transform. And yes, the examples from which we are supposed to gain this learning generally fail at this. Kent's '{Employee, Skill, Language}' is a particularly egregious example. His 4NF discussion has the definition:

> Consider employees, skills, and languages, where an employee may have several skills and several languages.

With that definition, we have '{Employee, {Skill}, {Language}}'. And it's a 1NF problem, with 2NF and 3NF solution '{Employee, Skill}' + '{Employee, Language}'. Trying to start with it as the basis for a 4NF discussion - which requires it to already be 3NF, which it isn't - means all the subsequent discussion of variants relates to issues that wouldn't be there if it was properly 3NF in the first place.

So to the degree this contributed to your "I may be dumb" comment, please feel assured that it's not you.

As I read Kent's first paragraph of his "4.1 Fourth Normal Form" section, I made the marginal note to "remove MV" from his description, making it:

> ... a record type should not contain two or more independent ... facts about an entity ...

I think this has more clarity. After all, it would be odd/unlikely for "independent facts" to not *also* be multivalued, and including that word is a bit of a distracting tautology. And, I reckon, this simpler version ties back exactly to the understanding of 4NF that I earlier said I used.

(His 5NF example is also problematic in the same way as the examples I used in my discussion of textbook treatments of 5NF.)

That's it for my comments. At least for now? Hope you find them useful.

5 replies by Alexey Makhotkin and others

18 more comments...

Minimal Modeling

Discussion about this post