Why attributes should live outside the anchor
In Minimal Modeling, anchor is NOT a container for attributes. This is one of the most important ideas, and it goes somewhat against the common thought. In today’s post I’ll talk about the driving forces and the consequences of this decision.
Tables and columns
First, where does the common thought come from?
In ordinary relational modeling, you create a table and this table contains a number of columns. For example, we have a users
table that contains user’s ID, login name and real name:
CREATE TABLE users (
id INTEGER PRIMARY KEY NOT NULL,
loginname VARCHAR(32) NOT NULL,
real_name VARCHAR(255) NOT NULL
);
It’s clear that the users
table is a container for the attributes (represented as columns). When we need to add one more attribute, we’ll try to add another field to this table, say “year_of_birth INTEGER
”. This works well, until it doesn’t. For example, we may feel that there are already “too many” columns in a table. Or, the attribute we need to add is somehow special, for example, some kind of private information about the user, and we don’t want to keep it in the same table (or even in the same database) as other, less private attributes.
We call this “Game of Tables”, by the way. The goal of this game is simple: you just need to decide where to put yet another attribute.
Anyway, the point here is that if you started with relational modeling (as many did), then it feels natural to think that attributes somehow live inside their entity (or noun, or anchor), and that’s why people do that. But this way of thinking may quickly become a liability.
Object-relational mapping (ORM)
Second, how does the familiar approach break?
When we write code that uses our database, we often use some sort of object-relational mapping library (ORM). Very often it is modeled after relational modeling, and it closely reflects the concepts of tables and columns.
So, we would probably have the following pseudocode definition of the User class:
class User {
define_id id : integer;
define_attribute loginname : string;
define_attribute real_name : string;
}
So that we may use this class like this:
var user = User.find(2);
print user.real_name;
user.set_attribute(loginname: “jake”);
Etc., etc. What we see from the definition of a User
class is that it needs to change when a new attribute is added. At some point, this becomes inconvenient: for example, what if we added some highly sensitive attribute? Suppose that we need to add an attribute that contains a scanned copy of a user’s passport. We don’t want it to be accessible so easily:
write_to_file(“/tmp/passport.jpg”, user.passport_image);
We will surely want to create another class that manages such sensitive attributes. (Or maybe just this particular attribute?) This is actually a variant of Game of Tables all over again, we can just call it Game of Classes. Where do we put the definition of an attribute? Does our code change if we want to move an attribute to a different class? (Of course it does change!)
So, what’s the problem here in a nutshell? The problem is that when you add a new attribute to the definition of User class, you have to recompile all the code that uses any other attribute defined in the User class. So, we have a false dependency between unrelated parts of code.
Another practical issue that arises when we use a typical ORM-inspired approach is how much data we load. In the code above, we loaded the instance of a User, and used only one of its attributes (we printed the real name of the user). Most probably, when the variable was initialized, the ORM code made the following query to the database: “SELECT * FROM users WHERE id = 2
”, and then put all the values to the corresponding fields. Thus, we needlessly loaded the value of the user's login name, even though the code does not use it. Moreover, this is going to happen everywhere where the User class is instantiated, unless you take special care.
When we have more attributes, we load more data. This may or may not be a particular problem worth optimizing, but on a systemic level, I believe, this is a huge problem. My personal estimate, based on many years of careful study of ceilings in many offices where I was waiting for the queries to finish: at least one third of the data read by the database server and transferred to its clients over the network is immediately discarded without use.
I consider “SELECT *
” to be one of the most problematic ideas in SQL. Asterisk means “give me everything that this table contains at the moment, no matter what other columns you’ve added since then and which I don’t care about”. This may lead to unpleasant surprises.
Minimal Modeling
In Minimal Modeling terms, we have a User anchor, and this anchor has ‘loginname
’ and ‘real_name
’ attributes. But the anchor itself does not contain any data, it only keeps track of IDs. Attributes contain the data, they keep a correspondence between existing IDs and the value of the attribute. For example, we know that the real name of a user with ID=3 is “Alice D.”.
When we need to add one more attribute, the definition of User anchor does not change. When we remove an attribute (because it’s no longer needed, or not allowed), the definition of User anchor also does not change. When we change the physical storage of the attribute, again, the definition of User anchor does not change.
This is unlike relational modeling, where the users
table would change when we add another column, when we remove the column and when we move a column to a different table.
In the following post we’ll discuss what happens when we begin documenting users
table, class User
, and Minimal Modeling catalog. We’ll see how independence of attributes and anchors helps with the documentation.