15 Comments

Alexey, I absolutely agree with you. These traditional formulations are difficult to understand. The ability to explain complex in simple is very valuable.

Expand full comment

# Presenting 4NF via non-4NF

Yes, we'd expect that an explanation of (n)th Normal Form (NF) would start with an instance in (n-1)th NF and explain why we need to make the additional transform. And yes, the examples from which we are supposed to gain this learning generally fail at this. Kent's '{Employee, Skill, Language}' is a particularly egregious example. His 4NF discussion has the definition:

> Consider employees, skills, and languages, where an employee may have several skills and several languages.

With that definition, we have '{Employee, {Skill}, {Language}}'. And it's a 1NF problem, with 2NF and 3NF solution '{Employee, Skill}' + '{Employee, Language}'. Trying to start with it as the basis for a 4NF discussion - which requires it to already be 3NF, which it isn't - means all the subsequent discussion of variants relates to issues that wouldn't be there if it was properly 3NF in the first place.

So to the degree this contributed to your "I may be dumb" comment, please feel assured that it's not you.

As I read Kent's first paragraph of his "4.1 Fourth Normal Form" section, I made the marginal note to "remove MV" from his description, making it:

> ... a record type should not contain two or more independent ... facts about an entity ...

I think this has more clarity. After all, it would be odd/unlikely for "independent facts" to not *also* be multivalued, and including that word is a bit of a distracting tautology. And, I reckon, this simpler version ties back exactly to the understanding of 4NF that I earlier said I used.

(His 5NF example is also problematic in the same way as the examples I used in my discussion of textbook treatments of 5NF.)

That's it for my comments. At least for now? Hope you find them useful.

Expand full comment

> I think this has more clarity. After all, it would be odd/unlikely for "independent facts" to not *also* be multivalued, and including that word is a bit of a distracting tautology.

This is a very interesting idea and I need to think about that, thank you!

> So the *same* function determines the `a` and `b` values for a given `k`.

And this one, from another comment too! Actually I think that it may be useful to try and give explicitly functional treatment of relational model. Or rather, use ideas from this post: https://graphicallinearalgebra.net/2015/09/08/21-functions-and-relations-diagrammatically/

Expand full comment

> With that definition, we have '{Employee, {Skill}, {Language}}'. And it's a 1NF problem, with 2NF and 3NF solution '{Employee, Skill}' + '{Employee, Language}'.

Uhm, I think I'm not getting it here. I'm not sure what your braces notation means — does it show the primary keys?

{Employee, Skill} — how does it handle the case of multiple skills? Is the primary key composite here (PRIMARY KEY (EmployeeID, SkillID))?

How do you show in your notation a case when Employee has a LegalName?

Expand full comment

# "Multi-Values Attributes" Versus "Single Atomic Values"

Go back and re-read Codd's original paper. He wrote of the elements in a tuple as being *domain* values. This seems to have been completely overlooked for a long time. Wu's reference to "single atomic values" feels to me like an assumption that everything is a "simple" NUMBER or TEXT. And hence *not* what Date discussed.

To me it was obvious from Date's original paper that a good relational implementation therefore needs a powerful domain-definition mechanism. If I can define a domain whose instances are array/list values, then I can define a single-value tuple element that happens to be a list of values. To the degree SQL has domain-definition capabilities, they are woefully inadequate and clumsy. And, presumably for those reasons, rarely used.

PostgreSQL includes more domain types, datatypes (https://www.postgresql.org/docs/current/datatype.html) in its parlance, than I ever had when using Oracle (up through 2012). Consider the PostgreSQL 'polygon' datatype (https://www.postgresql.org/docs/current/datatype-geometric.html#DATATYPE-POLYGON). This is a list of 'x,y' coordinate pairs with closure implied from the last to the first, and includes the area within.

If I did this in any version of Oracle I used, I'd have a 1NF transform to store each of the coordinate pairs separately and explicitly, including another value to sequence them. In PostgreSQL, I can simply declare a item of type 'polygon' because it supports single values within that domain.

But that's not the most important part. A domain must also come with operators to manipulate instances of it within the language. Which PostgreSQL includes (https://www.postgresql.org/docs/17/functions-geometry.html). In Oracle, I would have had to *also* write those. And they would have been callable functions, not using operators like '+', '-', ... like PostgreSQL has done.

But generally we are stuck with using "simple" domains like NUMBER or VARCHAR. Or a few complex domains that are incredibly useful - like DATE, TIME, and TIMESTAMP - and included for our convenience. And probably a (<ugh>) BLOB-type thing. And now JSON (which, unfortunately, is ripe for abuse perhaps in part because of the poor support for domains more generally). One could argue that FLOAT is also a complex domain.

Expand full comment

> He wrote of the elements in a tuple as being *domain* values.

I believe that this formulation is just one side of the coin, and the other side is that thou shalt not have relations are elements of relations.

I do not know how you can actually do that nowadays. I understand that back in 1970's Codd had something that his model was an alternative to: hierarchical databases.

Btw, as an exercise, I was thinking about how we could actually build a database that is actually non-compliant to 1NF. I have two answers, one is purely algebraical: https://minimalmodeling.substack.com/p/an-algebraic-curiosity. Recently I was thinking about another:

We could put a SQLite database into a BLOB in a different database. We could actually code this, for example implementing it as user-defined type in Postgresql. Then we would be able to write a query that can query the database inside that blob.

I think that it's important to think about that because, as we discussed, common explanations of 1NF are also not completely satisfactory.

Maybe I'm making things up but I believe that I've read statements such that NULL somehow contradicts relational model (makes it non-1NF?). Again, for me NULL is just a trivial sum type, a-la Maybe, and I don't see any problem with that. But apparently before Miranda, people had trouble imagining this.

Expand full comment

> And they would have been callable functions, not using operators like '+', '-', ... like PostgreSQL has done.

For me this is not an issue at all, again because of Haskell. In Haskell, operators and functions are just special syntaxed: you can write either

2 + 3

or, function'fying the operator by putting it in parentheses:

(+) 2 3

So, for me "plus" is a function.

> And now JSON (which, unfortunately, is ripe for abuse perhaps in part because of the poor support for domains more generally).

Well, lol, it's almost on par with classic databases. I mean, of course it does not support even simple dates, but this was for a long time basically the only non-basic type that many databases supported. Virtually no programming languages directly support dates also (I mean, except for the class-based implementations of course).

For me main problem with JSON is when you start putting objects into arrays. Everything else in JSON for me is basically a syntax sugar over the normal relational model.

Expand full comment

Thank you for your time!

> Wu's reference to "single atomic values" feels to me like an assumption that everything is a "simple" NUMBER or TEXT.

Since I started looking into that, I had this understanding that a) 1NF requires that "cells" do NOT contain another relation; b) people in the past feared that if you make your data structure too complicated, you are in a danger of introducing something that looks like relation. So, out of abundance of caution, somehow only the most trivial types were allowed.

That's why, for example, people have this strong reaction to columns that contain a string of comma-separated values. For me this design is just clumsy because you can trivially design it in a much more useful way (id, value), but I don't even think that comma-separated list of values is a relation. Both designs are isomorphic, so for me it's not a big deal. But I am not invested into this particular purity question; I would roll my eyes if I have to deal with CSV strings in SQL and that's it. I don't think CSV is 1NF-violating.

Unrestricted JSONs, though, may be a different matter. You can actually have something that closely resembles a typical relation in JSON. Say, an array of objects that each contain some attributes, and maybe even other nested arrays.

Technically, you can also define mapping from the JSON-based design and the classic relational design. For that you need to use things like LATERAL joins. It's going to be slow but would work. I'm not sure if it's possible to do in a different direction (convert classic relational design into a JSON with array of objects, possibly nested). Maybe with enough GROUP BY and string manipulation you can do that, but it's not useful.

Again, my theory is that relational model was invented at a uniquely unlucky time: it happened before the complicated data types a-la Haskell (and earlier, Miranda) were understood.

Expand full comment

# The "Two-Column" Solution

Your mention of this reminds me of early objections to my attempts to normalize. Programmers would posture about how it would lead to a "binary data model" of a huge number of tables each with 2 attributes. Which wasn't true. Unless that's where the data definitions took us.

Of course, in the years since there has been a push to key-value pair databases but, hey, we put them all in one table instead so that's not a problem. Except, of course, we lose the context of a proper data model making it harder to ensure the semantic integrity of the data itself.

Anyway, in my experience, there are pretty much always more than 2 columns. There may be attributes related to the relationship being captured. With update anomalies removed, adding business controls such as capturing when the entry is created, and on whose authority, might be important. Also we can't (immediately) delete business records, so there's also a starting timestamp (i.e., when INSERTed), and ending timestamp (updated from "eternity" to the current timestamp for a "virtual" DELETE). (I spent most of my career working on financial systems, so we pretty much always had these.)

By the way, there will also be a separate PRIMARY KEY for each row. One of my (hopefully) future posts will argue that this is a **Business** identifier. And not a "system", "surrogate", whatever "artifice".

Expand full comment

My comments finish up being a bit longer than I'd anticipated, so I'm posting them in 4 parts.

My first thought when reading this is that you're struggling with the same thing - trying to make sense of the Normalization explanations - that I was attempting to cover in my 5NF post (https://datademythed.com/posts/5nf_missing_use_case/). You can't conceive how the data items can be defined to justify the "before" and "after" normal forms being used as examples. Except your looking at Internet-based material while I was dealing with textbook treatments in the late-1970s through the 1980s. I hope you feel that is a fair assessment.

(My subsequent post (https://datademythed.com/posts/2nf_missing_use_case/) continues that discussion with respect to 2NF. Don't know if you noticed it. I only rely on folk using my RSS feed to know when I post an update.)

To be honest, I've never been able to grasp the whole "multivalued dependency" explanation. Particularly because it seems so ... well, wordy. But way, waayyy back I remember seeing something like the following as a 4NF definition (haven't been able to re-find it):

> Tuple {k,a,b} is 4NF if f(k)->a and f(k)->b.

So the *same* function determines the `a` and `b` values for a given `k`. Otherwise it needs to be split into `{k,a}` and `{k,b}`. I discuss an example of this in the 2NF post (https://datademythed.com/posts/2nf_missing_use_case/#4NF).

I don't know how correct this is, but it "clicked" - as in: made sense - for me. I spent my career with this as the basis of my 4NF understanding. And it served me very well.

As a teaser before my next bits, re your 4 "Summary and Conclusion" points:

1. Yes and no.

2. Yes, and wrongly so. "It's never clear where the initial design came from" is at the root of the problem. Especially in something that is offered as an academic treatment.

3. "Multivalued" is a misdirection that only adds confusion.

4. 4NF is, indeed, rooted in the "traditional way" as you've described it. But I think it'd be better if people understood the formalism of why it is so.

Expand full comment

I think this misses the point. 4NF is about whether the third column value is independent of the other two. For example let's say the employee learns a new skill "poetry", can they really teach it on all languages they know? Probably not, depending on the vocabulary and the level on each language. Thus to answer the question "Can this employee teach poetry in German?" we cannot have just two tables. To me 4NF and all it's "weird" examples are attempts to highliting this point: if the rest of the columns are independent of the choice of two particular columns then one of the said two columns can be moved to a different (2-column) table. Thus multivalued attributes are still something that needs to be considered before applying 4NF.

Expand full comment

> I think this misses the point.

Sorry, this misses the point of what exactly? Of 4NF, of the way it was presented historically, or something else?

> For example let's say the employee learns a new skill "poetry", can they really teach it on all languages they know?

The article specifies requirements upfront, and nowhere does it say anything about skill depending on the language, so I assumed that it implies that they are independent.

The schema for "Can this employee teach poetry in German" would be completely different, and I 4NF would obviously not be relevant.

> 4NF and all it's "weird" examples are attempts to highliting this point: if the rest of the columns are independent of the choice of two particular columns then

The "if" part is given as "true".

Expand full comment

As a retired Database Administrator and former programmer I would like to suggest that you have just written the reason. The reason is to confuse people enough that they will hire consultants to be data modellers instead of using 3rd normal form which they already understand. Math has a forth dimension, which some see as time, so the consultants borrowed the words to make things shiny and new and more expensive.

Expand full comment

Is it correct to summarize your general view as follows:

Back when RDBMS's were new, its proponents had to "un-teach" database administrators by defining the normal forms in terms of what they were NOT.

However, this pedagogy is unsuitable today. As students don't have these incorrect preconceptions, relational databases should be taught in terms of practical problems and how you would solve them with databases. (And the incorrect implementations, and the precise mathematical definition can be banished to an appendix.)

Expand full comment

Yes.

> (And the incorrect implementations,

Incorrect implementations could well be taught in a well-separated "history" subsection. Following the discussion of my article, I started to greatly appreciate the pre-relational context (paper cards and all) and I think it deserves more exposure.

> and the precise mathematical definition can be banished to an appendix.

That depends on what "precise mathematical definition" would be. Again, I am now more of an opinion that multi-valued attributes is just inadequate foundation for link tables. Tuple-based approach with uniqueness seems to me much more straightforward.

"attribute" in "MVA" suggests that one side is somehow subordinate to the other side. In most cases it's hard to say whether a) list of developers is MVA of a project, or b) list of projects is MVA of developers. (And it's even not needed for anything!) Tuple-based approach removes this.

I am thinking about writing a follow-up post with some updates.

Expand full comment