20 Comments

Alexey, I absolutely agree with you. These traditional formulations are difficult to understand. The ability to explain complex in simple is very valuable.

Expand full comment

# Presenting 4NF via non-4NF

Yes, we'd expect that an explanation of (n)th Normal Form (NF) would start with an instance in (n-1)th NF and explain why we need to make the additional transform. And yes, the examples from which we are supposed to gain this learning generally fail at this. Kent's '{Employee, Skill, Language}' is a particularly egregious example. His 4NF discussion has the definition:

> Consider employees, skills, and languages, where an employee may have several skills and several languages.

With that definition, we have '{Employee, {Skill}, {Language}}'. And it's a 1NF problem, with 2NF and 3NF solution '{Employee, Skill}' + '{Employee, Language}'. Trying to start with it as the basis for a 4NF discussion - which requires it to already be 3NF, which it isn't - means all the subsequent discussion of variants relates to issues that wouldn't be there if it was properly 3NF in the first place.

So to the degree this contributed to your "I may be dumb" comment, please feel assured that it's not you.

As I read Kent's first paragraph of his "4.1 Fourth Normal Form" section, I made the marginal note to "remove MV" from his description, making it:

> ... a record type should not contain two or more independent ... facts about an entity ...

I think this has more clarity. After all, it would be odd/unlikely for "independent facts" to not *also* be multivalued, and including that word is a bit of a distracting tautology. And, I reckon, this simpler version ties back exactly to the understanding of 4NF that I earlier said I used.

(His 5NF example is also problematic in the same way as the examples I used in my discussion of textbook treatments of 5NF.)

That's it for my comments. At least for now? Hope you find them useful.

Expand full comment

> I think this has more clarity. After all, it would be odd/unlikely for "independent facts" to not *also* be multivalued, and including that word is a bit of a distracting tautology.

This is a very interesting idea and I need to think about that, thank you!

> So the *same* function determines the `a` and `b` values for a given `k`.

And this one, from another comment too! Actually I think that it may be useful to try and give explicitly functional treatment of relational model. Or rather, use ideas from this post: https://graphicallinearalgebra.net/2015/09/08/21-functions-and-relations-diagrammatically/

Expand full comment

> With that definition, we have '{Employee, {Skill}, {Language}}'. And it's a 1NF problem, with 2NF and 3NF solution '{Employee, Skill}' + '{Employee, Language}'.

Uhm, I think I'm not getting it here. I'm not sure what your braces notation means — does it show the primary keys?

{Employee, Skill} — how does it handle the case of multiple skills? Is the primary key composite here (PRIMARY KEY (EmployeeID, SkillID))?

How do you show in your notation a case when Employee has a LegalName?

Expand full comment

Interesting question about the curly braces. Which I've not been asked before, nor thought about. On reflection, I've always used the curly braces of set theory, rather than the parentheses Codd uses for tuples. (He also prefixes a tuple with a name, which I don't do until I see what a normalized tuple actually represents.)

Can only guess it dates back to when I was taught normalization as part of my tertiary education. For perspective, it was after BCNF and before 4NF was defined. :-)

So by '{Employee, {Skill}, {Language}}' I mean there is a set of zero or more `Employee` instance values. Each 'Employee' instance can have a set of zero or more 'Skill' instance values, and/or zero or more 'Language' instance values. 'Skill' and 'Language' are independent of each other.

An 'Employee' having a legal name would mean we start with '{_Employee_, LegalName, {Skill}, {Language})'. I've added Candidate Key delineation around (I'm assuming) 'Employee' via the '_'s. (Don't know whether 'LegalName' could also be a candidate key, without asking you to further define it. :-) ) Which, after 1NF, becomes '{_Employee_, LegalName}', '{_Employee, Skill_}', and '{_Employee, Language_}'.

Expand full comment

(I've missed your comments because Substack did not send me any notification, hmmm.)

Thank you, that notation actually makes a lot of sense, I'm going to re-read your texts on 5NF in light of this. And primary keys underscore notation is useful.

Expand full comment

Please forgive a nitpick: I prefer Candidate Key(s) for identifiers in the logical model here. For me, the Primary/Alternate Key choice is a later decision, when implementing the model.

Expand full comment

# "Multi-Values Attributes" Versus "Single Atomic Values"

Go back and re-read Codd's original paper. He wrote of the elements in a tuple as being *domain* values. This seems to have been completely overlooked for a long time. Wu's reference to "single atomic values" feels to me like an assumption that everything is a "simple" NUMBER or TEXT. And hence *not* what Date discussed.

To me it was obvious from Date's original paper that a good relational implementation therefore needs a powerful domain-definition mechanism. If I can define a domain whose instances are array/list values, then I can define a single-value tuple element that happens to be a list of values. To the degree SQL has domain-definition capabilities, they are woefully inadequate and clumsy. And, presumably for those reasons, rarely used.

PostgreSQL includes more domain types, datatypes (https://www.postgresql.org/docs/current/datatype.html) in its parlance, than I ever had when using Oracle (up through 2012). Consider the PostgreSQL 'polygon' datatype (https://www.postgresql.org/docs/current/datatype-geometric.html#DATATYPE-POLYGON). This is a list of 'x,y' coordinate pairs with closure implied from the last to the first, and includes the area within.

If I did this in any version of Oracle I used, I'd have a 1NF transform to store each of the coordinate pairs separately and explicitly, including another value to sequence them. In PostgreSQL, I can simply declare a item of type 'polygon' because it supports single values within that domain.

But that's not the most important part. A domain must also come with operators to manipulate instances of it within the language. Which PostgreSQL includes (https://www.postgresql.org/docs/17/functions-geometry.html). In Oracle, I would have had to *also* write those. And they would have been callable functions, not using operators like '+', '-', ... like PostgreSQL has done.

But generally we are stuck with using "simple" domains like NUMBER or VARCHAR. Or a few complex domains that are incredibly useful - like DATE, TIME, and TIMESTAMP - and included for our convenience. And probably a (<ugh>) BLOB-type thing. And now JSON (which, unfortunately, is ripe for abuse perhaps in part because of the poor support for domains more generally). One could argue that FLOAT is also a complex domain.

Expand full comment

> He wrote of the elements in a tuple as being *domain* values.

I believe that this formulation is just one side of the coin, and the other side is that thou shalt not have relations are elements of relations.

I do not know how you can actually do that nowadays. I understand that back in 1970's Codd had something that his model was an alternative to: hierarchical databases.

Btw, as an exercise, I was thinking about how we could actually build a database that is actually non-compliant to 1NF. I have two answers, one is purely algebraical: https://minimalmodeling.substack.com/p/an-algebraic-curiosity. Recently I was thinking about another:

We could put a SQLite database into a BLOB in a different database. We could actually code this, for example implementing it as user-defined type in Postgresql. Then we would be able to write a query that can query the database inside that blob.

I think that it's important to think about that because, as we discussed, common explanations of 1NF are also not completely satisfactory.

Maybe I'm making things up but I believe that I've read statements such that NULL somehow contradicts relational model (makes it non-1NF?). Again, for me NULL is just a trivial sum type, a-la Maybe, and I don't see any problem with that. But apparently before Miranda, people had trouble imagining this.

Expand full comment

If I have a domain type of "relation" and the operators to manipulate it in the algebra, then I could have a single value from that domain in a tuple. Though I can't, in practice, conceive an upside while seeing lots of downsides. It also heads us back down the path of the CODASYL/network models of old which we saw had issues and we've tried to get away from that. It could also take me back to the risks in using JSON data types.

Codd says that in his relational model, '(table_name, table_data)' is a tuple, each of 'table_name' and 'table_data' are domains, and a tuple instance has a value from each domain to be a valid relation. So in the example in your linked post, either you have a domain system that lets you define a "table_data" domain that encompasses each of row values as a domain value instance within it, or or its not a relational data model.

In my experience, having a NULLable field along with NOT NULLs is certainly a 4NF violation *in the persisted data*.

But another problem is the poor discourse around NULL.

I have no problems with NULLs in transient result sets: that's for the programmer to resolve as part of their code.

They are defining the semantics by which the data is (trans)formed and consumed in the moment.

Also too often the NULL discussion gets caught between NULL as in "not (yet?) known", and "not applicable".

But "not applicable" is a different *and known* value/state, and might only be *represented* by NULL because there's no alternative.

A special actual value to indicate it in a TEXT field *might* be possible.

Other fields - numerics, dates/timestamps - are more problemmatic and often NULL is used as a stand-in.

Hopefully a well-documented one.

Expand full comment

> And they would have been callable functions, not using operators like '+', '-', ... like PostgreSQL has done.

For me this is not an issue at all, again because of Haskell. In Haskell, operators and functions are just special syntaxed: you can write either

2 + 3

or, function'fying the operator by putting it in parentheses:

(+) 2 3

So, for me "plus" is a function.

> And now JSON (which, unfortunately, is ripe for abuse perhaps in part because of the poor support for domains more generally).

Well, lol, it's almost on par with classic databases. I mean, of course it does not support even simple dates, but this was for a long time basically the only non-basic type that many databases supported. Virtually no programming languages directly support dates also (I mean, except for the class-based implementations of course).

For me main problem with JSON is when you start putting objects into arrays. Everything else in JSON for me is basically a syntax sugar over the normal relational model.

Expand full comment

Thank you for your time!

> Wu's reference to "single atomic values" feels to me like an assumption that everything is a "simple" NUMBER or TEXT.

Since I started looking into that, I had this understanding that a) 1NF requires that "cells" do NOT contain another relation; b) people in the past feared that if you make your data structure too complicated, you are in a danger of introducing something that looks like relation. So, out of abundance of caution, somehow only the most trivial types were allowed.

That's why, for example, people have this strong reaction to columns that contain a string of comma-separated values. For me this design is just clumsy because you can trivially design it in a much more useful way (id, value), but I don't even think that comma-separated list of values is a relation. Both designs are isomorphic, so for me it's not a big deal. But I am not invested into this particular purity question; I would roll my eyes if I have to deal with CSV strings in SQL and that's it. I don't think CSV is 1NF-violating.

Unrestricted JSONs, though, may be a different matter. You can actually have something that closely resembles a typical relation in JSON. Say, an array of objects that each contain some attributes, and maybe even other nested arrays.

Technically, you can also define mapping from the JSON-based design and the classic relational design. For that you need to use things like LATERAL joins. It's going to be slow but would work. I'm not sure if it's possible to do in a different direction (convert classic relational design into a JSON with array of objects, possibly nested). Maybe with enough GROUP BY and string manipulation you can do that, but it's not useful.

Again, my theory is that relational model was invented at a uniquely unlucky time: it happened before the complicated data types a-la Haskell (and earlier, Miranda) were understood.

Expand full comment

> ... I don't think CSV is 1NF-violating ...

Depends how it's defined. But the Normalization discourse is completely void of the discussing the issue of defining data. Hopefully you've noticed I talk of that often, including discussing how small shifts in definition change the Normalization outcome.

But what if we could define a CVS Domain, something that enforced its content correctness as well-formed CSV? What about PDF, or FLAC, or Markdown Domains for a Content Management System? (I abhor references to these as "unstructured data": they are *very* structured with lots of useful metadata embedded within.)

More basically, why do I have to define a Foreign Key's data type? Why isn't it implicitly matched to the target Primary/Alternate Key's data type? Failing that, at least give me a syntax to do it, similar to PostgreSQL's PL/PGSQL and Oracle's PL/SQL having "<PKTable>.<PKCol>%TYPE" for example.

The lack of good Domain support is a serious drag on good data model implementation.

Expand full comment

# The "Two-Column" Solution

Your mention of this reminds me of early objections to my attempts to normalize. Programmers would posture about how it would lead to a "binary data model" of a huge number of tables each with 2 attributes. Which wasn't true. Unless that's where the data definitions took us.

Of course, in the years since there has been a push to key-value pair databases but, hey, we put them all in one table instead so that's not a problem. Except, of course, we lose the context of a proper data model making it harder to ensure the semantic integrity of the data itself.

Anyway, in my experience, there are pretty much always more than 2 columns. There may be attributes related to the relationship being captured. With update anomalies removed, adding business controls such as capturing when the entry is created, and on whose authority, might be important. Also we can't (immediately) delete business records, so there's also a starting timestamp (i.e., when INSERTed), and ending timestamp (updated from "eternity" to the current timestamp for a "virtual" DELETE). (I spent most of my career working on financial systems, so we pretty much always had these.)

By the way, there will also be a separate PRIMARY KEY for each row. One of my (hopefully) future posts will argue that this is a **Business** identifier. And not a "system", "surrogate", whatever "artifice".

Expand full comment

My comments finish up being a bit longer than I'd anticipated, so I'm posting them in 4 parts.

My first thought when reading this is that you're struggling with the same thing - trying to make sense of the Normalization explanations - that I was attempting to cover in my 5NF post (https://datademythed.com/posts/5nf_missing_use_case/). You can't conceive how the data items can be defined to justify the "before" and "after" normal forms being used as examples. Except your looking at Internet-based material while I was dealing with textbook treatments in the late-1970s through the 1980s. I hope you feel that is a fair assessment.

(My subsequent post (https://datademythed.com/posts/2nf_missing_use_case/) continues that discussion with respect to 2NF. Don't know if you noticed it. I only rely on folk using my RSS feed to know when I post an update.)

To be honest, I've never been able to grasp the whole "multivalued dependency" explanation. Particularly because it seems so ... well, wordy. But way, waayyy back I remember seeing something like the following as a 4NF definition (haven't been able to re-find it):

> Tuple {k,a,b} is 4NF if f(k)->a and f(k)->b.

So the *same* function determines the `a` and `b` values for a given `k`. Otherwise it needs to be split into `{k,a}` and `{k,b}`. I discuss an example of this in the 2NF post (https://datademythed.com/posts/2nf_missing_use_case/#4NF).

I don't know how correct this is, but it "clicked" - as in: made sense - for me. I spent my career with this as the basis of my 4NF understanding. And it served me very well.

As a teaser before my next bits, re your 4 "Summary and Conclusion" points:

1. Yes and no.

2. Yes, and wrongly so. "It's never clear where the initial design came from" is at the root of the problem. Especially in something that is offered as an academic treatment.

3. "Multivalued" is a misdirection that only adds confusion.

4. 4NF is, indeed, rooted in the "traditional way" as you've described it. But I think it'd be better if people understood the formalism of why it is so.

Expand full comment

I think this misses the point. 4NF is about whether the third column value is independent of the other two. For example let's say the employee learns a new skill "poetry", can they really teach it on all languages they know? Probably not, depending on the vocabulary and the level on each language. Thus to answer the question "Can this employee teach poetry in German?" we cannot have just two tables. To me 4NF and all it's "weird" examples are attempts to highliting this point: if the rest of the columns are independent of the choice of two particular columns then one of the said two columns can be moved to a different (2-column) table. Thus multivalued attributes are still something that needs to be considered before applying 4NF.

Expand full comment

> I think this misses the point.

Sorry, this misses the point of what exactly? Of 4NF, of the way it was presented historically, or something else?

> For example let's say the employee learns a new skill "poetry", can they really teach it on all languages they know?

The article specifies requirements upfront, and nowhere does it say anything about skill depending on the language, so I assumed that it implies that they are independent.

The schema for "Can this employee teach poetry in German" would be completely different, and I 4NF would obviously not be relevant.

> 4NF and all it's "weird" examples are attempts to highliting this point: if the rest of the columns are independent of the choice of two particular columns then

The "if" part is given as "true".

Expand full comment

As a retired Database Administrator and former programmer I would like to suggest that you have just written the reason. The reason is to confuse people enough that they will hire consultants to be data modellers instead of using 3rd normal form which they already understand. Math has a forth dimension, which some see as time, so the consultants borrowed the words to make things shiny and new and more expensive.

Expand full comment

Is it correct to summarize your general view as follows:

Back when RDBMS's were new, its proponents had to "un-teach" database administrators by defining the normal forms in terms of what they were NOT.

However, this pedagogy is unsuitable today. As students don't have these incorrect preconceptions, relational databases should be taught in terms of practical problems and how you would solve them with databases. (And the incorrect implementations, and the precise mathematical definition can be banished to an appendix.)

Expand full comment

Yes.

> (And the incorrect implementations,

Incorrect implementations could well be taught in a well-separated "history" subsection. Following the discussion of my article, I started to greatly appreciate the pre-relational context (paper cards and all) and I think it deserves more exposure.

> and the precise mathematical definition can be banished to an appendix.

That depends on what "precise mathematical definition" would be. Again, I am now more of an opinion that multi-valued attributes is just inadequate foundation for link tables. Tuple-based approach with uniqueness seems to me much more straightforward.

"attribute" in "MVA" suggests that one side is somehow subordinate to the other side. In most cases it's hard to say whether a) list of developers is MVA of a project, or b) list of projects is MVA of developers. (And it's even not needed for anything!) Tuple-based approach removes this.

I am thinking about writing a follow-up post with some updates.

Expand full comment