(3/3) "Relational Database Management: A Status Report" [Chris Date, 1983]
Rewatching Chris Date's talk from 1983 today. Pt. III. Possible future developments.
A few months ago I watched a video from 1983: Chris Date “Relational Database Management: A Status Report”, a keynote speech given at the 10th Australian Computer Conference: https://youtu.be/watch?v=VnNbddUMZQI. Here are my comments on some parts of the talk that I found interesting.
This is part 3 of 3. Previously:
Introduction and Pt. I: The Relational Model: https://minimalmodeling.substack.com/p/watching-chris-date-1983-1
Pt. II: Current Systems: https://minimalmodeling.substack.com/p/watching-chris-date-1983-2
Pt. III. Possible future developments
"In terms of primary keys now the situation in systems today is that most systems do not support primary keys. For example, in the IBM systems what you have is you have the ability to force a field to be unique, and you have the ability to force a field not to be null. So you can impose that discipline for yourself, but the system doesn't know what you've done and it turns out there are many places where it needs to know what you've done"
Looks like this fundamental part of the technology is still oscillating between “very familiar” and “not supported”.
"And then you name the primary key and the system understands that and it enforces the uniqueness without me having to say so and it enforces the not-null mess."
What’s interesting here is “not-null mess”. I mean, it’s not clear to us now where the mess was. Primary keys are not null, full stop. I guess that this is an echo of the entire NULL debate (also something that would have benefitted from the algebraic data types insight).
"I will admit frankly that the biggest deficiency in relational systems today is lack of support for foreign keys. Because you see, if the system will not maintain this integrity for you, who gets to do it? Answer: the user. That means you have to write procedural code to do this stuff yourself. I just hate to think of every installation, for every database for every foreign key in the database writing a bunch of code to do this stuff, when really the system should do it. And as a matter of fact some systems are beginning to do this but nobody as far as I know does 100% job and I would like to see that as soon as possible."
This part of the debate came full circle I think. Apparently, it was very important for them to insist on maintaining integrity as part of the definition of a relational system. (He says more about this later in the talk, but I did not transcribe the part.) Basically the main concern is: what to do if you want to delete a record with a primary key that is referred to by some foreign key, or to change the primary key of that record. The possibilities are well known: a) forbid that; b) cascade delete/update the rows in every table that has foreign keys referring to that primary key; c) update the rows, setting foreign key to NULL.
But this is only possible if all the tables live in a single database, without exception. When the data could be split into multiple databases, you lose the ability to do this coordination in practice (due to the CAP theorem, yadda-yadda-yadda).
In today’s systems we basically abandoned the idea of maintaining this integrity in the original sense, as of 40 years ago. The basic integrity stays, of course, that’s obvious. But we just don’t do those problematic operations. Nobody really changes the value of primary keys (why would you?); you build your queries so that they are more tolerant to dangling references; you cross-check your databases from time to time, looking for dangling references, and you have a plan of what to do when you find them.
Of course, in some older databases you may still have the luxury of having actual old-school referential integrity, with primary key lookups on each insert/update/delete, but you must be prepared to abandon it if/when this system no longer handles the load and needs to be split out.
"the question is: adding a field if you ever feel that is a foreign key, does that mean you have to change programs? and that unfortunately requires a lot of discussion I simply don't have time. I'm sorry, but I'll talk to you in the break."
This question sounds a bit odd now, maybe it’s another remnant of this “physical access” mentality?
Then he mentions a couple of things: first, outer joins are a big deal and just started to appear in some vendors (IBM does not have those, and their customers demand that). Second, updatable views require primary keys (that was not a given at that time), and join-based views require foreign keys (that was even more rare). One thing I’d like to note is that it seems that the whole idea of updatable views has faded away almost completely. Classic relational database servers support them, of course, but AFAIK none of the newer solutions have that, and the views are rarely used.
More on views:
"Situation today is: some systems don't support views at all, some systems support views but they're read-only, some systems support views with update but the update rules are very ad-hoc."
What’s interesting here is that everybody apparently thought that views are a good idea, but were a bit fuzzy on what it actually was.
Personally I’m happy that views were felt to be necessary, because views show that you shouldn’t pay too much attention to the physical schema, you can always rewrite it (maybe even in your head) as a set of equivalent views. For example, any storage could be represented as pure 3NF or pure 6NF, even if the data is stored in JSON or whatever. Views seem to have fulfilled the need for more logical layer.
Recursive SQL was also mentioned. CONNECT BY was introduced by Oracle "in the 1980s": https://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL#CONNECT_BY. I thought it happened much later. I remember reading about Oracle implementation in 2002 or 2003 and it felt like something otherworldly and amazing.
"Of course, if you're in COBOL and... well... serves you right. (I didn't say that, erase that from the tape.)"
COBOL jokes are timeless, you can probably find them even on cuneiform tablets.
"I expect to see PhD theses coming out of universities over the next several years on new clever ways to optimize relational expressions. And that stuff can then be folded into products. I think we are today in optimizing relational expressions where the programming languages people were, say, 15 years ago and optimizing arithmetic expressions, you know. And they're very very good at it now and I think we will get good in a few years."
"I have a somewhat unorthodox opinion in this area. I happen to think that a physical design aid is very very important but a logical design area is not. Let me explain. This optimizer in the optimizing systems like DB2, what it does is it takes the request and it looks around the database to see what indexes there are and it figures out which indexes to use. Great. Question: where did the index come from? Answer: some human being put them there, the database administrator. Question: how does the database administrator know which indexes to put there? Answer: either he guesses or he has to know how the optimizer works, and neither of those is a good answer. And the whole point about optimization is it should be automatic and transparent, as a black box, as magic, you know, it just works.
What we need badly is some mechanical aid to tell the database administrator where there should be indexes. And in IBM Research such a tool has been built. The input to the tool is the set of queries that you typically want to ask the system with their relative frequencies. And then the design aid goes and talks to the optimizer and says: if you had these indexes what code would you generate? And it talks to the optimizer and figures out the best mix of indexes and comes back and then tells the database administrator. It is still a human decision as to whether you want the indexes, but that's how it works. And what is interesting about this is that this is a problem that is only going to get worse. See, right now at the physical level of these systems there aren't too many options. So physical database design is not too hard, but as more and more options are put in at the physical level, the new structures like hashing and so on, that is going to get more difficult and the requirement for this is going to increase."
As for the physical design aid or, simply: the thing that suggests indexes — twenty four years later, in 2007, this prediction became relevant once again: database cracking (also known as adaptive indexing) was invented.
The “should be automatic and transparent, as a black box, as magic, you know, it just works” statement is interesting. You probably expected people to want more control over their resources, such as disk space. Only relatively recently we accepted the idea of fully automatic space allocation in the name of search efficiency and reducing the administrative work. Before that you were expected to create every single index yourself. You wanted the opposite of black box, even if you were not really qualified for this kind of control.
"As for the logical design aid. Well, here I'm very skeptical. I've never seen a logical design where it wasn't easier just to do the design than it was to learn how to use the design aid. The point is that the input to the design aid has to come out of the designer's head and if the designer has all that information usually he could just do the design"
This is super interesting, and it’s not clear what those people really wanted. Nowadays you can imagine taking a database schema, feeding it to ChatGPT and receiving a draft of logical schema description, maybe even a draft of Minimal Model definition?
But what did they want to use as input, and what would have been the output?
"The relational model is good for centralized databases. It is essential for distributed databases. All the distributed systems I know, the prototypes that is, that are being built — all use the relational model as a basis. And there are many reasons why that is so. One of them is this optimizing. Again, it's very important in a distributed system to choose the right strategy"
CAP Theorem: 1998.
> “New kinds of data. The relational model is very good at handling character strings and numbers and dates and times in some systems. And an interesting question is: can we extend the model to handle new kinds of data, like text and images and icons, things like that, retaining the conciseness and precision and elegance of the model."
Wikipedia says that database BLOBs were invented by Jim Starkey around that time, in the 1970s-1980s.
But I guess that Date talks not only about binary data, but rather things like full text search (I’m not sure what was the distinction then between images and icons).
"Before ever I was in databases, I was in programming languages and I regard it as rather sad that many of the things the database people have discovered, languages people already knew. There really has been sort of parallel development with not too much cross-fertilization. I would like to see some; it is beginning to happen."
Yes. NULLs, domains and some database validation questions (such as handling of either/or data) would have been in a better state today if relational databases research and functional programming research met in a critical point of the timeline.
"For example, an employee is represented by a row in the employee table. That's not true, a statement is inoperative at this time. Typically, an employee is represented by one row in the employee table, and ten rows in the employee_benefits table, and three rows in employee_dependents table, and 27 rows in the employee_job_history table, and so on, and so on. All of these things together represent the real-world *entity* if you like, and it would be nice if the system understood that all of those things were related. Foreign keys are important here, and there has been some work on elevating the model."
This point of view probably led us to the invention of ORMs.
More interestingly, if you have some sort of anchor/attribute/links logical model, this “real-world entity” is easy to assemble: given a certain ID value, you fetch all the attributes, and also fetch all the links from or to that entity. There is a counterargument, however. Suppose that you have a table of items that you sell, and each item is sold thousands and hundred of thousands of times. If you take this naive view of “real-world entity”, then you would include all the records about this item being sold. Item name, for example, is obviously a part of this “composite entity”, and a link to the vendor is too. But the list of orders that include this item doesn’t look right as part of item’s description. This is interesting because it illustrates our idea that entities, links and attributes are independent and concatenative (more on that later).
"Ted Codd, again, has a thing called RMT — an extended relational model, which is (as a matter of fact he first presented in Australia about three years ago). it has not been implemented and that is still in somewhat of a ragged form".
This was new to me: https://en.wikipedia.org/wiki/Relational_Model/Tasmania
This was an interesting journey in the earlier days of relational technology. Thank you for reading! I had a bit of a hiatus with this substack, but hope to return to the regular cadence. There are several topics unfinished from the previous series in the drafts.