(2/3) "Relational Database Management: A Status Report" [Chris Date, 1983]
Rewatching Chris Date's talk from 1983 today. Pt. II, Current systems.
This is part 2 of 3. Introduction and part 1: https://minimalmodeling.substack.com/p/watching-chris-date-1983-1.
A few months ago I watched a video from 1983: Chris Date “Relational Database Management: A Status Report”, a keynote speech given at the 10th Australian Computer Conference: https://youtu.be/watch?v=VnNbddUMZQI. Here are my comments on some parts of the talk that I found interesting.
Pt. II, Current systems
"I'm going to show you a list of some current products. All of the products in this list at least claim to be relational. I do not think they all are relational and I'll explain that point in a few minutes. The point is: the market perceives that relational is a good thing to be. And by the way, these are products. It does not include prototypes, these are all things you can go and pay good money for. And furthermore, it is very incomplete. Since this list was compiled you know I could add four or five times as many as I'm going to show you. But I think conservatively that there are at least 40 real relational systems now. That 40 includes a lot of systems from micros. If you're building a database system from a micro, you'd be insane if you didn't build a relational one, frankly. My list takes up two pages"
The number of systems is in the same ballpark as today.
I’ve been googling a lot while listening to this part of the talk. Here is an assortment of links to the companies and products that were mentioned.
By the way, SQL first appeared in 1974 (as SEQUEL). The ANSI/ISO "Database Language SQL" standard would be published in 1986, three years after the talk.
https://en.wikipedia.org/wiki/Actian (began as Relational Technology Inc. in 1980, later Ingres Corporation)
"Jumping to my next page, another system here, Oracle, has done the same thing. Oracle is another system that supports the SQL language. And what is impressive about Oracle is the number of different systems that it runs on. Originally it ran on PDP-11s and VAXes, then on the Data General Eclipse. It runs on IBM 370s and 370 look-alikes under VM. They have announced an MVS version. I don't know if it's been released yet. They also have a version that runs on the 68000 chip. And they originally designed their system to be very very portable. They say that it consists of about 2,000 modules written in the C programming language. If you don't know what the C programming language is, it is outgrowth from an earlier language that was called B. <laughter> That's all you need to know about C. It was designed to be portable. Of those 2,000 modules they claim that you only need to change 30 of them to move from one system to another".
C appeared in 1972, eleven years before this talk. But Date still finds it necessary to explain what it is. I would have thought that it was much more familiar at that time.
"So like it or not, SQL is going to become a very important language. It might become an actual standard and it almost certainly will become a de-facto standard."
"in fact I often describe SQL as the COBOL of relational database and that is meant both in good senses and bad"
Quipping about COBOL used to be as much fun 40 years ago as it is now. COBOL was designed in 1959, by the way, so it was already 24 years old.
Then Date talks about PC databases, mentioning https://en.wikipedia.org/wiki/DBase (1979) and https://en.wikipedia.org/wiki/R:Base (1981). Also, https://en.wikipedia.org/wiki/ADABAS (1971) is mentioned.
"It's impossible to keep up anyway. Every time you open Byte magazine, there's a new ad for a new system"
Ah, so magazines were the primary source of information like this.
"And one of the columns in the table [listing all of the available products] was architecture type. And the entries here were things like “hierarchical”, “CODASYL”, and so on. Now, here are some of the entries in that architecture type list. First, "inverted list relational-like". Now that confuses me right away, because inverted list is a physical structure and relational is a logical structure. When we talk about relational databases we mean: the user sees tables and the user manipulates tables at the logical level. Down at the physical level you can do whatever the system is smart enough to do, but those are two different things. "Relational-like" appeared many times. "Relational shared structures": I have no idea what that means. "Network with relational view", "hierarchical with relational capabilities", "quasi-relational". I really like that. You can see what happened, okay: you know, here's the guy compiling the table, he gets on the phone to the vendor, says "well, how would you describe your system?" "Well, it's sort of quasi-relational". You know, no deep understanding. And then there were some that really were relational."
This is an amazing historical insight into the evolution of this technology and the way some species mimic other species to increase the chances of survival.
"In this paper he [Codd] has this proposal: to qualify as relational a database management system must support tables, and nothing but tables. It's important to say what is not there as well as what is there, because otherwise a system like IMS for example could say it was tables, and therefore relational, you know. Because the segments in IMS which are like records, are like the rows and tables. So tables and nothing else at the logical level. Specifically, there are no user-visible navigation links between those tables. And then out of those eight operators of the relational algebra: at least select, project and join. Select, if you remember, picks out rows, project picks out columns, and join glues two tables together"
Here again it’s more clear where the first normal form (1NF) comes from. “Segments in IMS”, what is that even.
"Specifically, the user should be able to join any table to any table wherever that makes sense. The user should not be constrained to only being able to join tables where there is some physical connection under the covers. Now, there are systems on the market today that support join, or say they support join, but the join only works if the database administrator has thought six months ago: ah, I expect people want to join these two tables together, I better physically connect them through some kind of index or pointer chain or something. If that's the situation you have, you're back in the bad old game with having programs dependent on the physical structure. If you need to change the physical structure, you have to change programs, and that's what we wanted to get away from. So these operators should be supported without requiring physical predefinition of access patterns."
I’ve already lamented above that unconstrained join is not a good idea. Here we can see where this relational rebellion probably comes from. Apparently, allowing the join used to be non-trivial, you had to use some sort of “pointer chain”, whatever that was. Also, most of us probably cannot really appreciate today how much the programs “depended on the physical structure”, and how lucky we are that this is no longer the case.
"The question [from the audience] is: isn't it true that to join two tables you would have to have a common key in the two tables to do the join on, so you therefore had to think about at least that six months ago. You do have to have a common field. It doesn't have to be a key, by the way, it can be any field. But you do have to have a common field, but the argument is fields are part of the logical structure of the data. You thought about that when you did the logical database design and you separated the logical structure from the physical structure."
Again, reiterating that joins do not depend on keys.
"Second point is: select, project and join are not the whole algebra, that's true. But they're the pieces of the algebra you need most of the time. I would guess 95 percent of all things you want to do, that you can do with the algebra you can actually do with just select, project and join. So that's the big useful subset."
This is interesting because it shows this ongoing bargaining of what relational technology is really. I think that the way the technology evolved in the last couple decades was that many restrictions on what is a relational database have been relaxed. For example, this lets us admit that MongoDB is a relational database.
"There is a myth that has been going around the commercial world in the database world for several years now, which is expressed in various forms. one is: oh yeah, relational is okay for query, but it's no good for real work. Operational databases, or production data, or transactional processing or whatever you want to call that. Another one you hear is relational systems require a hardware breakthrough, that is, in hardware associative memory, if they're ever going to perform acceptably, and so on and so on. Well, I think you can guess my opinion on that, but I'm going to skip over the file that says what it is. This is my position statement: I believe there is no intrinsic reason why a relational system should perform any worse than any other kind of system".
The “hardware associative memory” part seems to be an echo of some long-forgotten war. Looking it up on Wikipedia, we learn about https://en.wikipedia.org/wiki/Dudley_Allen_Buck. He invented a lot of technology and died an untimely death at the age of 32.
There is the system smart enough to take high level relational requests and convert them down into code on those indexes that is as good as hand code, as good as a clever programmer will write, coming directly in on that level. Well I'm going to quote the IBM systems here because they're the ones I know best. SQL TS and indeed DB2 also, they take SQL statements as the user requests and they compile them. By the way, that's important too: they compile them, not interpret them, and they produce code down on the indexing level that in many cases is as good as hand code. As the matter of fact, that's exactly what the optimizer is for. I mentioned the optimizer component you had to have, these systems have an optimizer, and this is exactly what they do: they try and generate code as good as hand code. In many cases they do it."
This explanation I think was necessary at the time because you wanted to explain to people what happens when you remove physical access to the database file. Physical access could be hand-optimized, but relational model seemed to completely remove this possibility. So you may be worrying about the performance. Date explains that there is actually a compiler, just like in programming languages. I guess that the successes of compiler optimizations at that time allowed you to use this parallel as a favorable argument.
"I'll give you a trivial example of that. In the SQL language you can ask a question like: select all the information you have about employees, where the employee number is in... and then you list some employee numbers: 1, 5 and 93. So you're trying to find information about those three employees. Now probably the best way of implementing that is to use the employee number index, which is almost certainly there, and go directly to employee 1, directly to employee 5, directly to employee 93, and that's it. but what the optimizer does with that query today is, as soon as it sees that “in”-condition it says "oh my god, this is complicated" and does a sequential scan. But the point is that we know that that's a case where we don't do a very good job and we can fix it. We can fix it transparently, without affecting the user level at all, that's what this data independence is all about. You see, this optimizing technology is in its infancy, we're just starting."
Historical datapoint: “
SELECT * FROM employees WHERE emp_id IN (1, 5, 93)” could not be optimized at that time. Something that we take for granted today.
"So the program is joining A and B. Sometimes you need to do that even if you don't have join operations. Now, if you're joining tables A and B, there's fundamentally two things you can do. You can go through the A's and for each A find all of B's. Or you can go through the B and for each B find all the A's. Depending on the physical clustering of the data on the disk, depending on data distributions, depending on the relative sizes of tables A and B, on many other factors, one of those two approaches will outperform the other by orders of magnitude. If the application program at the record level chooses the wrong approach there is no way you can optimize it. In a SQL system you say: "join A and B", and the optimizer does the right thing, because it knows this information. And furthermore, the right solution today may not be the right solution a year from now. Because the disk may change, you may change the physical clustering, so you need to re-optimize. Re-optimizing in a relational system is trivial: you just recompile the statement and the optimizer now does the other right thing. Re-optimizing hand code is very hard. You've got to rewrite the program, and perhaps you don't. So there is a sense in which relational systems might ultimately perform better than non-relational ones, because the optimizer can have information available to it that is not available to the hand coder."
What Date explains here is the problem of optimal join ordering. This problem is relevant today as ever, due to several reasons. First, we now tend to split data into many different databases. They may even use the same database server implementation, but because the data is on different servers, you can’t join them directly, you must do the join on the client side. Some companies disallow or discourage even the trivial joins when the tables are known to be in the same database. In this case, if we use the normal JOIN, the optimizer has all the data it needs to optimize the join order. If we’re forced to do a client-side join, for any reason, we have to consider this optimization for each such composite query. Most of the time we probably just ignore the problem and it all works out, but sometimes we’re forced to think about that really hard, otherwise our queries would be unacceptably slow.
Another way to restore the ability to do optimized joins is to put a copy of all the data into a single huge data storage, and call it data warehouse or data lake.
"I am NOT saying that a relational system *today* is going to perform like IMS *today*. If you have one particular application where you know precisely the kinds of transactions you want to run there's no question that you can build physical structures and tune the system in IMS or something like that today and you will get better performance than a relational system today. But I repeat, that is not an inherent situation and I think in a few years the relational people will catch up. After all you're comparing release 1 or 2 of a new technology with a system that's been out there for 15 or 20 years and has been tuned for those 15 or 20 years"
This was another strategic bet that turned out to be correct.
"I'm only talking about machine performance, not the overall performance of the DP (data-processing?) shop of course. And I talk about machine performance because it's always been regarded as the Achilles heel of relational systems, but if you look at the total installation performance and the people's performance... Well I'll tell you the first IBM customers on the SQL systems have come back and told us they're getting anywhere from five to twenty times improved productivity with the systems. And that was the point in the argument that machines are getting cheaper, people are getting more expensive. The economics of the 80s says we want to optimize people’s performance rather than machine performance."
This is a fascinating example of systems thinking, laid out very clearly.
I’ve split the text in three parts. The third and final part, “Possible future developments” will go out tomorrow.