Designing data access layer, pt. I

Let’s indulge in a bit of a carefree discussion of a design space of a modern data access layer.  Often this is colloquially known as ORM, but we don’t want to commit to both “object” and “relational” parts.

Many arthropods shed their exoskeletons to let the organism grow, in a process called molting.  I find this one of the most useful metaphors for systems design.  State of the art always advances, and older implementations, however flexible, often cannot keep up with some fundamental changes of the point of view. When we have time and resources we can grow some of our systems by implementing them anew.

So, let’s design a legacy-free data access layer toolkit, using some of the ideas that we’ve discussed before in this substack.

There are maybe a dozen of subcomponents handling lower- and upper-level responsibilities.  For example:

  • low-level drivers for a specific data storage server (e.g. MySQL, Cassandra, S3, HTTP-based APIs, etc.)

  • SQL read-only query generators (and other varieties of query language generators);

  • Structured data changing components (insert, update, delete);

  • Schema discovery and migration;

  • Etc., etc. 

We’re going to compile a complete list as the series progresses.  We’re going to extract and discuss well-defined areas of responsibility one by one; then we’ll be looking at what’s left and see how we can distill more structure.

Generalized LEFT JOIN

For today’s issue let’s discuss a generalized LEFT JOIN.  The task is quite simple:

  • We have a list of IDs of a certain noun (or entity, or anchor);

  • For each ID, we want to get the values of a number of attributes;

For example, we have a list of User ids: (2, 3, 5, 7); and we want to get their login_name, real_name, year_of_birth, and avatar as a JPEG blob.

Writing in a pseudo-Ruby, let’s begin with this:

user_ids = [2, 3, 5, 7]
             .join_attributes(:login_name, :real_name, :year_of_birth, :avatar).each do |row|

   p [row[:id], row[:login_name]]
   # => [2, “vasya”]

The language part: syntax, naming, chaining of functions, etc does not matter at this time: we do not design a “real” library, we’re trying to find out the useful requirements.  Also, many things that we find essential are hard to notice looking only at the example code.  For different classes of languages (strong-typed vs dynamic-typed, etc) we would be able to design some parts of our toolkit in different ways.  For now we’re interested in stuff that is true for any language.

Batch-first approach

First, our toolkit should only allow fetching data by the array of IDs, and never by a scalar ID.  If you want to fetch a single object you must use a single-element array.  Any convenience methods that circumvent that are not allowed, full stop.

Disallowing scalar ID access makes it a bit easier to prevent the so called N + 1 query problem (also called shotgun queries).  It happens when we traverse an array of results, and for each item we execute a separate SQL query).  Such queries are typically very lightweight, but because we need to do the full round-trip to the database, resulting latency usually becomes problematic.

So, we want to encourage the developer to think in batches from the very beginning.

Attributes depend on nouns, not the other way around

If we look at the definition of Anchor::User class, there will not be any attributes; neither in the code, nor in some runtime introspection.   The only thing that Anchor::User provides is the ID.

Also, there is probably no way to get “all possible attributes of a noun”: we’re going to discuss this later.

Each attribute is defined separately, in its own class. For example, we’re going to have classes like Attr::User::LoginName, Attr::User::RealName, etc.   Those classes depend on Anchor::User, and they define where each of the attributes is stored: e.g. database server driver, cluster, database, table, column.  For non-relational storages this could be e.g. S3 bucket and base URL.

We do that because of the concatenability principle.  We want to be able to easily remove (and add back again) any of the “features”, and that includes data attributes.  If a noun class definition includes attributes, then any code that depends on a noun class will also implicitly depend on one of the attributes.  We want all the code to explicitly depend on features that it uses.

Another reason is that not all attributes are created equal.  Some of the attributes are more sensitive than others: for example, those that contain personal information, or “heavy” attributes containing an avatar JPEG file.  So, we want to make sure that the code that uses sensitive attributes is explicitly aware of that fact and it runs with proper access controls, etc.

That’s also why we do not want typical OOP-based ORMs design. Nowhere in this system would you be able to use the typical “user.real_name”, due to the reasons explained above (there is no “real_name” attribute of a “User” class).

Restricted joins

The syntax that we use enforces a very restricted subset of a relational-like LEFT JOIN operator.  This allows us to provide many performance optimizations that will be transparent for the clients, such as:

  • Automatic handling of attributes that live in the same table;

  • JOIN-free SQL queries;

  • Parallel querying of independent storage systems;

  • Automatic sharding;

We discussed all those approaches in the “Many faces of left join” post.

Of course, other components of our toolkit would allow you to execute unrestricted JOIN operators, including other kinds of JOINs.  This will be discussed later.

Response latency

The code (Ruby block here, or lambdas, or closures, or callbacks, or whatever) that processes each result row should be called as early as possible, asynchronously.  This of course depends on the capabilities of our programming language, runtime environment, database server network protocol, database server capabilities and many other things.  But ideally, processing of the first row should begin while the rest of the dataset is still being processed.

Of course, we have to preserve this latency-first approach across our entire stack of components: from the database driver (and the database itself) to the highest-level abstractions.

Surely, we may want to have a “simple” function that returns the complete array with the results, but it must be a convenience wrapper over the real asynchronous thing.  And we must make the async version as convenient as possible for developers of any level of experience.

Let’s wrap up for now.  We’ve listed four requirements that seem to be the most important:

  • Batch-first approach;

  • Attributes depend on nouns;

  • Restricted joins;

  • Response latency.

“Designing data access layer” series will continue in a few weeks. Next week we’ll probably begin tackling the database normalization forms.