The scope of the data modeling book
I’ve been thinking about the scope of the modeling book. What is there to write about? I think that the entire problem domain occupies roughly four layers:
Actual business domain: imagine that there are no computers and you keep track of the business using actual ledger books and registers of all kinds, trade catalogs, etc. What would be written in those books?
Logical model: anchors/attributes/links (or entity-relationship diagrams of all kinds). If we refactor our ecommerce website from Oracle to MongoDB, without changing anything else — the logical model would stay exactly the same. This is where Minimal Modeling primarily lives.
Physical model: how many tables and columns there are, how they are named, how boolean values are stored, what databases there are, etc., etc. Physical model is hard to refactor because you need to change the application code, and do a data migration. Physical model is usually straightforward to extend (add a new table or a column).
Below physical model. This is how the database server stores bytes: database file formats, writing and reading patterns, cache awareness, latency and throughput issues, types of storage and network devices, etc. This layer evolves on its own, even if the physical model doesn’t. This means that if we would have started the same business from scratch using the same logical model 5-10-20 years later, we would probably have chosen a somewhat different physical model, just because the access patterns and available optimizations have changed.
We are interested primarily in the Logical model layer. Because we want to be practical, we must discuss how the logical model maps to the physical model (how the tables would look like). But we can only give generic advice on how a sensible table structure would look, and discuss common pitfalls. Mind you, this advice is perfectly good and your system would work well. There are lots of details to dig into, if you demand a lot of optimization: performance, storage or organizational. But this is clearly out of scope of the book. In addition, it requires a lot of knowledge about your choice of database implementation, your storage, and data access patterns.
Scope outline
So, what are the main areas of our scope? Here is the draft outline I have so far:
Basic logical model;
Either/or/or data;
Polymorphic data;
Template repetitions;
Basic physical model;
Secondary data.
Basic logical model: Anchors, attributes, links
The first one is ordinary common logical models, straight from the Data Modeling 101 course. In the words of Chris Date:
There are only three databases in the world, you probably know that. One of them is "departments and employees", which is this one. One is "suppliers and parts", which you're going to see later, and the other one is "courses and students".
This topic is essential, almost everything else uses it in one way or the other.
Either/or/or data
This is where data dependencies are introduced. Example: imagine a service where you submit a photo, it is processed and you get a list of animals that are in this photo. There are three states of the uploaded photo:
Pending processing;
Processed; list of animals is stored in a separate table, ready to be queried by the user;
Error: e.g., broken or unsupported image format. Additionally, an error message is stored in a separate column to be returned to the user.
Here, “a list of animals” only makes sense in “processed” state, and the “error message” is valid only in “error” state. People commonly handle it in various ad-hoc ways, but systematic treatment is clearly needed.
Polymorphic data
Imagine a complicated e-commerce website that sells books, laptops and thousands of other types of items. Users want to search each type of item according to their specific properties. For example, you may want to search for hardcover books printed in the 1970s that have “astronomy” in the title. Or you want to find a laptop with 32Gb of memory and AMD processor. How do you model this?
On the other hand, all of those item types are somehow similar: you can buy a book and a laptop in a single order, and you will get an invoice that has two line items. Both books and laptops (and other types of items) have a price, title, inventory number and some other data that is common for all kinds of items. How do you model this?
You can see that modeling polymorphic data relies on modeling either/or/or data. “Number of pages” is only valid if the type of the item is a “book”, while “RAM size” is only valid for laptops.
Polymorphic data is different from either/or/or data because it requires thinking about what the item is (its identity). In the example from the previous section, the submitted photo is a simple thing, only its state changes. Two polymorphic data items can have arbitrarily different structure: different properties and other related data.
Template repetitions
(I’m not sure if that is the right term for that). Movie tickets, hotel rooms, flights, course assignments. There is nothing particularly special about this one, but people seem to get confused about that because of the way language works.
For example, let’s model an airline.
A specific modification of Airbus A340 has 250 passenger seats, with some specific seat layout.
We have three planes, Airbus A340 with 250 seats each. They are registered in our database of planes that the airline owns or rents.
We establish a daily flight from Lisbon to Oslo using one of those planes, and due to regulatory restrictions we can only fly up to 200 passengers on this route. Remaining seats stay empty, or could be used to transfer our crews or other non-commercial passengers.
For up to two months in the future, we have 200 tickets for sale on each daily flight. When you buy a ticket you are assigned a seat.
The word “seat” is used four times here. But those are four different concepts of the seat. The information about available seats is copied from 1/ the database of plane models to 2/ the database of our planes to 3/ the database of flights we operate to 4/ the database of tickets that we sell. But you have to precisely understand which of the “seats” you’re using in a sentence.
This is especially important during modeling, especially in a stressful situation.
Same thing happens with course assignments, hotel rooms, movie tickets, employee salaries and many other scenarios.
The approach that is explained in the data modeling book that I’m working on is especially beneficial for this common and important case of template repetitions Using carefully worded questions you can reduce the confusion between different kinds of seats, different kinds of course assignments and different kinds of salaries. Judging by the posts in some database modeling forums, this is a common source of confusion.
Basic physical model
Finally, we need to discuss the basic physical model. For the purposes of the book we settled on the classic relational approach to physical modeling: one table per anchor, one column per attribute. Links live in a separate table or in a column, depending on cardinality. Learning this approach lets you quickly get up to speed with implementing the actual system that you need to implement.
Other physical models
We briefly discuss other approaches to relational physical modeling such as entity-attribute-value. Also, we discuss a bit of JSON modeling.
Something also needs to be said about purposeful denormalization. Amazon DynamoDB is particularly interesting here. Also, designing star and snowflake schemas, and maybe even just flat tables.
Secondary data
One last big chapter that also needs to be discussed in the scope of the book is secondary data.
Suppose that you have a list of items bought by customers from different cities. If you have one thousand items sold, you will have one thousand rows in the “sales” table. For faster reporting, you may want to create a separate table with sales grouped by quarter, or by the region. You will need to keep this table up-to-date with regard to the primary “sales” table. This pre-aggregated table is a typical example of secondary data.
In other words, we want to use different storage for reading and for writing. Another word for that that you may have heard is CQRS (“command-query responsibility segregation”). We write to the “sales” table, but we read from a different, pre-aggregated table. The main reason to do that is to optimize performance. The price we pay is that we must keep the second table up-to-date.
This is a big topic, and it touches on a lot of questions that are outside of the book scope. But we need to discuss how the secondary data fits into our modeling approach, because basically every database needs to include secondary data.
I hope to have the second draft this week or next week, with several new chapters and updates based on the feedback on the first draft.