Designing data access layer, pt. I

Jun 15, 2021

Let’s indulge in a bit of a carefree discussion of a design space of a modern data access layer. Often this is colloquially known as ORM, but we don’t want to commit to both “object” and “relational” parts.

Many arthropods shed their exoskeletons to let the organism grow, in a process called molting. I find this one of the most useful metaphors for systems design. State of the art always advances, and older implementations, however flexible, often cannot keep up with some fundamental changes of the point of view. When we have time and resources we can grow some of our systems by implementing them anew.

So, let’s design a legacy-free data access layer toolkit, using some of the ideas that we’ve discussed before in this substack.

There are maybe a dozen of subcomponents handling lower- and upper-level responsibilities. For example:

low-level drivers for a specific data storage server (e.g. MySQL, Cassandra, S3, HTTP-based APIs, etc.)
SQL read-only query generators (and other varieties of query language generators);
Structured data changing components (insert, update, delete);
Schema discovery and migration;
Etc., etc.

We’re going to compile a complete list as the series progresses. We’re going to extract and discuss well-defined areas of responsibility one by one; then we’ll be looking at what’s left and see how we can distill more structure.

Generalized LEFT JOIN

For today’s issue let’s discuss a generalized LEFT JOIN. The task is quite simple:

We have a list of IDs of a certain noun (or entity, or anchor);
For each ID, we want to get the values of a number of attributes;

For example, we have a list of User ids: (2, 3, 5, 7); and we want to get their login_name, real_name, year_of_birth, and avatar as a JPEG blob.

Writing in a pseudo-Ruby, let’s begin with this:

user_ids = [2, 3, 5, 7]
Anchor::User.with_ids(user_ids)
             .join_attributes(:login_name, :real_name, :year_of_birth, :avatar).each do |row|

   p [row[:id], row[:login_name]]
   # => [2, “vasya”]
end

The language part: syntax, naming, chaining of functions, etc does not matter at this time: we do not design a “real” library, we’re trying to find out the useful requirements. Also, many things that we find essential are hard to notice looking only at the example code. For different classes of languages (strong-typed vs dynamic-typed, etc) we would be able to design some parts of our toolkit in different ways. For now we’re interested in stuff that is true for any language.

Batch-first approach

First, our toolkit should only allow fetching data by the array of IDs, and never by a scalar ID. If you want to fetch a single object you must use a single-element array. Any convenience methods that circumvent that are not allowed, full stop.

Disallowing scalar ID access makes it a bit easier to prevent the so called N + 1 query problem (also called shotgun queries). It happens when we traverse an array of results, and for each item we execute a separate SQL query). Such queries are typically very lightweight, but because we need to do the full round-trip to the database, resulting latency usually becomes problematic.

So, we want to encourage the developer to think in batches from the very beginning.

Attributes depend on nouns, not the other way around

If we look at the definition of Anchor::User class, there will not be any attributes; neither in the code, nor in some runtime introspection. The only thing that Anchor::User provides is the ID.

Also, there is probably no way to get “all possible attributes of a noun”: we’re going to discuss this later.

Each attribute is defined separately, in its own class. For example, we’re going to have classes like Attr::User::LoginName, Attr::User::RealName, etc. Those classes depend on Anchor::User, and they define where each of the attributes is stored: e.g. database server driver, cluster, database, table, column. For non-relational storages this could be e.g. S3 bucket and base URL.

We do that because of the concatenability principle. We want to be able to easily remove (and add back again) any of the “features”, and that includes data attributes. If a noun class definition includes attributes, then any code that depends on a noun class will also implicitly depend on one of the attributes. We want all the code to explicitly depend on features that it uses.

Another reason is that not all attributes are created equal. Some of the attributes are more sensitive than others: for example, those that contain personal information, or “heavy” attributes containing an avatar JPEG file. So, we want to make sure that the code that uses sensitive attributes is explicitly aware of that fact and it runs with proper access controls, etc.

That’s also why we do not want typical OOP-based ORMs design. Nowhere in this system would you be able to use the typical “user.real_name”, due to the reasons explained above (there is no “real_name” attribute of a “User” class).

Restricted joins

The syntax that we use enforces a very restricted subset of a relational-like LEFT JOIN operator. This allows us to provide many performance optimizations that will be transparent for the clients, such as:

Automatic handling of attributes that live in the same table;
JOIN-free SQL queries;
Parallel querying of independent storage systems;
Automatic sharding;

We discussed all those approaches in the “Many faces of left join” post.

Of course, other components of our toolkit would allow you to execute unrestricted JOIN operators, including other kinds of JOINs. This will be discussed later.

Response latency

The code (Ruby block here, or lambdas, or closures, or callbacks, or whatever) that processes each result row should be called as early as possible, asynchronously. This of course depends on the capabilities of our programming language, runtime environment, database server network protocol, database server capabilities and many other things. But ideally, processing of the first row should begin while the rest of the dataset is still being processed.

Of course, we have to preserve this latency-first approach across our entire stack of components: from the database driver (and the database itself) to the highest-level abstractions.

Surely, we may want to have a “simple” function that returns the complete array with the results, but it must be a convenience wrapper over the real asynchronous thing. And we must make the async version as convenient as possible for developers of any level of experience.

Let’s wrap up for now. We’ve listed four requirements that seem to be the most important:

Batch-first approach;
Attributes depend on nouns;
Restricted joins;
Response latency.

“Designing data access layer” series will continue in a few weeks. Next week we’ll probably begin tackling the database normalization forms.

Jun 16, 2021

You're mixing several concepts here, that are somewhat related but not in the way you present them.

The first one is the notion of DAL (data access layer) is somewhat similar to ORM. It's not. A DAL is an abstraction layer that encapsulates common database operations such as query execution, transaction handling, etc. to be used with a common interface by your upper layer. Your upper layer may be a thin approach (eg. using Repositories or a Table Gateway), a Data Mapper (my favourite), an actual ORM or even something else.

You n+1 query problem seems to be non-existant in most code I see. The only place where this is common is to figure window function sizes. Also, it is quite important to notice, many DB client libraries actually (sort-of) stream the results, so you may not receive the full dataset you selected at once, but row by row. This is also commonly used with cursors, which are used since the early 90's. It is quite important to be aware when NOT to abstract this away, so you won't have to process all the dataset to gather the information you need. Again, proper ORMs provide hydration mechanisms that leverage all this little nuances without a developer having to worry about it.

The way you propose to use attributes is - in my opinion - atrocious. It mangles what could be simple code completion in something to be managed in 3 different places - your database schema, your attribute map, and your actual data container (either object or dictionary). Keeping them in sync for a medium application is a nightmare. I understand your enthusiasm for the gained capability of having an Entity (in this case, a user) that will have data stored in multiple places. However, this is an architectural mistake, and - to be blunt - piss-poor design. In some scripted languages, this will also have the downside of requiring initializing ALL of the dependencies (including connection to external systems) even if you're just checking if the user is active or not, so lazy loading is not possible.

Following on this, on a proper multi-layer application, you would have both a notion of Entity - which in its essence may be just a database row, and a set of services that perform operations on entities. A good strategy to follow to midsize applications is domain:operation:who, ex: storageService->getUserAvatar(currentUser). You may, of course, encapsulate this further into services that compose a set of operations for user, such as sending email, updating avatar, etc.

There are a couple of common architectural patterns I'd recommend you read into, such as onion and hexagonal architectures. They should not be considered gospel, but may guide you into getting a better domain separation on your application design.

Somewhere on the article you seem to also hint at dynamic attributes, a bit "a-la" JSON. There are several ways of doing this, but my usual approach is to keep them separate from the proper schema implementation, and you will *not* want to have attributes such as user.name as a db field and user.favourite_color as a automagically mapped dynamic field, both for performance and maintainability reasons. There are a couple of common patterns/anti-patterns for this, and there is no magic bullet. I've implemented from completely dynamic EAV solutions to having an HSTORE "attributes" field on each relevant table, with some attribute naming in some constants somewhere. Each problem requires a different solution, depending on the usage of the field. If the field is to be filterable (either indexed, or a facet), sometimes it pays off to just handle it "manually" with a caching table.

Then you mention async. I won't get into details on why you're probably using async without any advantage (and probably with several disadvantages), but the main problem is that you mix the concept of async code with network streams. Those are two completely different (but often used together) concepts. I suggest you dig into the reactor pattern for a quick but explicit example of one that is not the other. It is also worth noticing that closures or lambdas aren't necessarily async. In fact, in most languages they aren't - at all.

Expand full comment

3 replies by Alexey Makhotkin and others

3 more comments...

Minimal Modeling

Discussion about this post