(1/3) “Relational Database Management: A Status Report” [Chris Date, 1983]
Rewatching Chris Date's talk from 1983 today. Introduction and Pt. I, “The relational model”
A few months ago I’ve watched a video from 1983: Chris Date “Relational Database Management: A Status Report”, a keynote speech given at the 10th Australian Computer Conference: https://youtu.be/watch?v=VnNbddUMZQI. (Somewhere on that page you can find my comment with video timestamps, some sort of table of contents.)
The video is 90 minutes long and it’s so fascinating that I rewatched it again to write down everything that has caught my attention. I would highly recommend watching it in full: Chris Date is an amazing speaker. The talk is chock-full of technical details with a lot of historical interest, explained in a very accessible manner.
First, let’s look at the timescale. The video is from 1983, recorded almost 40 years ago. Chris Date was 42 at that time (personal note: I’m almost the same age now). He has 20 years of experience at that point, starting in 1962; particularly, twelve years of experience with relational databases. "A Relational Model of Data for Large Shared Data Banks" paper has been published by Codd in 1970, so the whole idea of relational databases is just thirteen years old at that point. Ted Codd was 60 at the time.
Oracle was established in 1977; MySQL would be introduced in 1995. The Internet consisted of about 1000 nodes. TCP/IP already existed, but DNS was introduced just a month ago.
The rest of the text is my comments on some parts of the talk that I found interesting. I’ve transcribed maybe half of the text, based on Youtube’s automatic transcription.
“It looks increasingly likely that by 1990 most people interacting with the database will be using a relational interface".
I was going to say that this prediction held up surprisingly well, but maybe it’s a bit self-recursive. Depending on what “database” is, it looks like “most people” interact with the interface that uses search engines such as Google, and we may eventually find that soon it’s going to be some sort of LLM-based solutions.
But certainly, my bet is that for the structured data that is well-suited for the relational model, it’s going to be handled by solutions based on the relational model.
"Certainly there is no shortage of systems in the marketplace that either are or claim to be relational"
Later he shows a list of 40+ such systems. The number is roughly similar to what we have now.
"three generations of relational systems"
You no longer hear “generations” being mentioned. Maybe this has been exactly the moment when systems were finally converging to some sort of archetype, and the conceptual development was basically done, and it made less sense to talk about generations. However, if you listen to some of the concerns that they are were with, the situation seems to be very far from converging.
Pt. I, The relational model (timestamp 06:45)
"Relational databases became a very hot topic. If you look at the product announcements these days, almost every announcement you see in the database world is either for a new relational system or for a new relational frontend to one of the older systems."
"[after the show of hands] About 50% of the people in the room already know what a relational model is."
I wonder what that number would be today, in 2023.
"The old systems that were around in 1970 are still hard to understand. The basic problem is that there are no broad simple principles underlying the design of the system. Instead, there is just a broad set of special cases. The user does not have in his brain a simple conceptual picture of what's going on. He cannot predict with confidence what would happen in a given situation if he hasn't tried it before. He cannot extrapolate from what he knows to what he doesn't know."
This topic interests me so much because I want to better understand what the relational model was the alternative to. What was the baseline, what was the background, what they tried to differentiate from.
It’s hard to imagine what was the nature of this problem: “cannot predict with confidence what would happen”. How did the day of a data engineer look back then? What were the sources of friction, technically and organization-wise? How did your understanding converge, until you were experienced enough to be able to extrapolate?
"Logical and physical layers of the systems were not clearly separated". "The insert/update/delete rules are very-very complicated. For example, anyone here who knows IMS will know that that is true: the insert/update/delete rules there take pages and pages and pages of the manual to explain. And they're very-very confusing."
Again, it’s hard to imagine what was written on those “pages and pages and pages” describing something that is now an INSERT statement. Was this complexity in the structure of the data, or maybe in the rules resolving concurrent updates?
We know that the notion of ACID (atomicity, consistency, isolation and durability) already existed in around 1973, ten years before that. The ACID acronym was coined in 1983 (Haerder, T.; Reuter, A. “Principles of transaction-oriented database recovery”), IBM IMS supported ACID since 1973. So, it’s plausible that this concern was discussed amongst the rules.
IMS is https://en.wikipedia.org/wiki/IBM_Information_Management_System.
Later in the talk we learn that before the relational model you had to think about data storage on a much lower level. If I understand correctly, you had to write the data using some sort of semi-structured data files, and you thought more in terms of bytes and file offsets(?).
The question is: what exactly did the relational model untangle? There were maybe half a dozen intertwined concerns. The new approach removed maybe a couple of those concerns from the tangle so that a lot of cognitive energy was freed up as a result.
The question we may ask today is: what’s the structure of the current tangle that we see in the data management industry and how can we reduce it as decisively as the relational model did back then?
"As a result of all of that, installing a system was a very slow job and it took very highly-specialized skills to do that. And when I say slow, I'm talking about periods of one year, 18 months, two years for complicated databases."
It would be interesting to know what was the definition of "complicated", which metric did you look at? A word count of the business domain description?
Also, “two years” doesn’t seem like a very long time (I guess that “installing a system” is not what we think of today as “installing”).
"Because the logical and physical levels are not clearly separated, far too much application programmer effort was devoted to maintaining programs. That is, when a change was made in the physical structure of the database, which you had to do from time to time, for example for performance reasons or new requirements coming along, then typically you'd have to go back and change some of the application programs as well, because they were tied to that physical structure. And that is still true today in some systems. A figure of 50 percent of application programmer effort in some installations is quite realistic and that means that those people are changing old applications instead of building new ones. That's the famous application backlog problem."
Alright, we still have the backlog problem, additionally encumbered by the invention of Jira (circa 2002). As for “they were tied to that physical structure”, I want to reiterate that this was more “physical” than what we now call “physical”. Here Date is probably talking about the actual bytes in some sort of file that you had to carefully write to. Not sure what happened if you miscalculated the physical write, did you clobber the database? How hard was it to restore from the backups? What was the deployment and testing strategy?
"Spaghetti-style data structures" [Codd]
The term “spaghetti code” was introduced at the end of the 1970s. But this is the first time I hear this applied to “data structures”. The relational modeling probably solved THAT part.
"The programmers had to know physical level as well as logical level".
"Furthermore, in mathematics if you have a problem which concerns n-ary relations (now, n-ary relations are things connecting to things connecting to things connecting to things), what you do is you break it down into an equivalent but simpler problem involving just binaries. Binaries are a canonical form, so binaries are attractive but, and this was Ted's real contribution, Ted said: if we could define operators, like join, which work directly on n-ary relations then two things would happen. The first thing would be that now you could process more information in one operation. Bigger chunks of information could be manipulated in single programming statements if you like. So people could be more productive"
This is where, I think, the relational theory inadvertently influenced the practice not in a good way. Relational theory works on tuples with arbitrary number of elements, and this idea was transferred directly to the way the tables are designed.
If we hadn’t forgotten that every n-ary relation could and should be decomposed into binary relations, we could more clearly see the logical structure inside the physical structure. If you deconstruct a table, you could extract several attributes, each represented by a dataset that contains two columns: an ID and the attribute value. This dataset deconstruction can be completely virtual, imagined.
Of course it can also be physical: that’s what the Anchor Modeling does: they design every table to contain the minimal possible number of columns, say, two.
A lot of the informal introductory texts on the Internet still make unnecessary fuss around the sixth normal form (6NF), somehow claiming that it’s unusual, academic, impractical etc. This, I think, is the direct result of teaching the theory with n-tuples as a core concept.
You don’t need to keep the table physically in 6NF, but you can and should think about the relational data as a set of independent anchors (entities), attributes and links (relationships).
"There are only three databases in the world, you probably know that. One of them is "departments and employees", which is this one. One is "suppliers and parts", which you're going to see later, and the other one is "courses and students". My contribution to this field is that I invented the "courses and the students" database."
"For reasons that do not concern us now, the columns of the table, which are like fields in the records, are called attributes in the formal relational theory."
The term “field” here seems to be an artifact of how the data was stored before the introduction of relational model. This is part of what the First Normal Form (1NF) is about. Read more in: Making sense of 1NF, pt. II.
"Primary keys are a very familiar notion, while foreign keys are perhaps less familiar. A foreign key is a field in one table whose values match the values of the primary key in another table. I'll say that again: <slowly> the foreign key is a field in one table whose values match the values of the primary key in another table. Example: <...>"
You had to explain what a foreign key is to the technical audience in 1983. As we’ll see later in the talk, this notion is not so familiar as Date says: the primary keys are not such an established concept at all. You would reveal primary keys in the system, but you not necessarily designed the data system, beginning with the concept of a primary key.
"Domains. I don't have time to discuss domains in detail, I'll just tell you briefly that a domain is a conceptual pool of values. For example, in this "department and employee" database there is a pool of Department numbers, a conceptual set, and every value appearing in the department number column in this table and every value appearing in the department number column in *this* table, all of those values are values from that conceptual pool. But for the purposes of this presentation we can more or less forget about domains there. They're important but they're not as important as tables".
Ugh. This, in my opinion, was one unfortunate timing of relational technology evolution. The “domain” is basically an algebraic data type. Relational systems had simple data types: you had numbers (for prices or the number of items); you had strings (for names and addresses); you had dates.
But the primary keys were mostly undifferentiated integers and this, I think, could have been introduced in a much more useful way just a few years later. Haskell appeared in 1990. Miranda (a programming language that was a predecessor of Haskell) appeared in 1985. They helped bring the idea of algebraic types to the research mainstream. Particularly, you could define that user ID=10 is a different number 10 than the product ID=10.
Type systems are a complicated topic. They tend to expand until you no longer understand how to make typechecker happy. If you let type people to roam freely, they’re going to completely hijack your system, making the cognitive load unbearable.
But if only we would have introduced just a little bit more strict definition of what a primary key is: that there are user IDs and there are product IDs and you can’t compare the two. That way your join would refuse to run if you accidentally used the wrong ID in your query. We would also then have a complete list of entities (anchors) from the start, and this would have enormously improve understanding of any database schema.
This, of course, would have been hard to do for at least one more reason: the insistence that the join MUST allow joining basically any pair of columns, no matter what they are (more on that later in the talk). Were it more strict, allowing join only on IDs by default, I really think that we would have a better data industry.
This is, of course, utter speculation on my side. But if I were introducing people to modern data technology, this would be one of the things that I’d insist on: both in schema design and in data querying.
"Now, any given database will have its own specific rules as well. For example, the "department and employee" database would have a rule that says, let's say, "salaries must be positive", but that is local and specific for that database".
The keyword here is “local”. See also: “On the uselessness of CHECK constraint”: https://minimalmodeling.substack.com/p/persisting-eitheror-values-or-the. More on constraints: https://minimalmodeling.substack.com/p/database-constraints-pt-i
"I could justify that rule [entity integrity] but it would take me several minutes to do, so for the purposes of this presentation just please believe primary keys must not be null"
In 1983, you still had to remind people that primary keys are not null.
"Any foreign key value must match an existing value of the corresponding primary key. If you have Department 7 here you better have Department 7 there. Unless the foreign key is null. It might be possible under some circumstances for foreign keys to be null. For example if it is possible in this company for an employee to be in no department at all, at a certain time, and that's possible in IBM, because it happened to me. <laughter> Don't inquire too closely... My personnel record then in the personnel file would have a null Department number."
That’s an amusing historical anecdote. It’s good to have an important stakeholder who personally depends on one of the requirements of your system.
"And the last two which are the most complicated (but they're still not very complicated). First, join. Join takes two tables which have a common column. For example, these two tables both have a column of B's and you can join the tables together on that common column on the basis of matching values".
We’ll see that again. Joins are completely unrestricted historically, and that was not a good thing in retrospect.
"Finally, divide. Divide is the most complicated of these operators. And fortunately, it's the one you need least. I'm just going to explain it in its simplest possible form."
What follows is the explanation of the divide operator from relational algebra. I couldn’t follow that one, but I found a nice hint towards understanding: "An intuitive property of the division operator of the relational algebra is simply that it is the inverse of the cartesian product." https://stackoverflow.com/a/34979122/7754. Now I need to listen to the explanation again, maybe I’ll understand it this time.
“Join in particular is such a useful operation in practice that systems ought to support it directly even though it is not a primitive operation"
What Chris Date calls “join” we call “inner joins”. Then he proceeds to explain what (left) outer join is.
On a personal note, I’m an adept of the Church of Left Join: https://minimalmodeling.substack.com/p/many-faces-of-left-join (there will be more discussion of joins later).
I’ve split the text in three parts. The second part, “Current systems”, will go out tomorrow.
Thanks for the notes. I've been meaning to watch this one. I've watched most of his videos available behind the O'Reilly Media paywall from about 2012-2014. BTW, I like n-ary, because I'm a dimensional modeling aka star schema advocate. a Fact table in a dimensional design is in 3NF in an n-ary design that cannot be broken down into binary relations without losing information.