Designing data access layer, pt. I

Jun 15, 2021

Let's design a legacy-free data access layer toolkit, using some of the ideas that we've discussed before.

4 Comments

Jun 16, 2021

You're mixing several concepts here, that are somewhat related but not in the way you present them.

The first one is the notion of DAL (data access layer) is somewhat similar to ORM. It's not. A DAL is an abstraction layer that encapsulates common database operations such as query execution, transaction handling, etc. to be used with a common interface by your upper layer. Your upper layer may be a thin approach (eg. using Repositories or a Table Gateway), a Data Mapper (my favourite), an actual ORM or even something else.

You n+1 query problem seems to be non-existant in most code I see. The only place where this is common is to figure window function sizes. Also, it is quite important to notice, many DB client libraries actually (sort-of) stream the results, so you may not receive the full dataset you selected at once, but row by row. This is also commonly used with cursors, which are used since the early 90's. It is quite important to be aware when NOT to abstract this away, so you won't have to process all the dataset to gather the information you need. Again, proper ORMs provide hydration mechanisms that leverage all this little nuances without a developer having to worry about it.

The way you propose to use attributes is - in my opinion - atrocious. It mangles what could be simple code completion in something to be managed in 3 different places - your database schema, your attribute map, and your actual data container (either object or dictionary). Keeping them in sync for a medium application is a nightmare. I understand your enthusiasm for the gained capability of having an Entity (in this case, a user) that will have data stored in multiple places. However, this is an architectural mistake, and - to be blunt - piss-poor design. In some scripted languages, this will also have the downside of requiring initializing ALL of the dependencies (including connection to external systems) even if you're just checking if the user is active or not, so lazy loading is not possible.

Following on this, on a proper multi-layer application, you would have both a notion of Entity - which in its essence may be just a database row, and a set of services that perform operations on entities. A good strategy to follow to midsize applications is domain:operation:who, ex: storageService->getUserAvatar(currentUser). You may, of course, encapsulate this further into services that compose a set of operations for user, such as sending email, updating avatar, etc.

There are a couple of common architectural patterns I'd recommend you read into, such as onion and hexagonal architectures. They should not be considered gospel, but may guide you into getting a better domain separation on your application design.

Somewhere on the article you seem to also hint at dynamic attributes, a bit "a-la" JSON. There are several ways of doing this, but my usual approach is to keep them separate from the proper schema implementation, and you will *not* want to have attributes such as user.name as a db field and user.favourite_color as a automagically mapped dynamic field, both for performance and maintainability reasons. There are a couple of common patterns/anti-patterns for this, and there is no magic bullet. I've implemented from completely dynamic EAV solutions to having an HSTORE "attributes" field on each relevant table, with some attribute naming in some constants somewhere. Each problem requires a different solution, depending on the usage of the field. If the field is to be filterable (either indexed, or a facet), sometimes it pays off to just handle it "manually" with a caching table.

Then you mention async. I won't get into details on why you're probably using async without any advantage (and probably with several disadvantages), but the main problem is that you mix the concept of async code with network streams. Those are two completely different (but often used together) concepts. I suggest you dig into the reactor pattern for a quick but explicit example of one that is not the other. It is also worth noticing that closures or lambdas aren't necessarily async. In fact, in most languages they aren't - at all.

Expand full comment

Reply (1)

Alexey Makhotkin

Jul 4, 2021

hi,

sorry it took me quite some time to read 700 words and highlight the parts that I wanted to respond to. Thank you for the comment.

> You n+1 query problem seems to be non-existant in most code I see.

In my environment this is sometimes a big problem and people are often reminded that it exists and that you should strive to prevent it while the traffic is small. On the positive side, this problem could be a quick win, a low-hanging fruit, because fixing it is perfectly straightforward.

> The way you propose to use attributes is - in my opinion - atrocious. It mangles what could be simple code completion in something to be managed in 3 different places - your database schema, your attribute map, and your actual data container (either object or dictionary).

Code completion is a good thing, but it loses to the runtime benefits (if they exist). Also, restoring the ability to do code completion under this system could be a separate engineering exercise.

I consider this complexity essential, because there are actually three things, just as you say. So, there should be three corresponding things somewhere in the code base.

> an Entity (in this case, a user) that will have data stored in multiple places. However, this is an architectural mistake

I'm not sure what you call "an architectural mistake". Is it "storing data in multiple places"? It seems that storing different attributes of the same noun in vastly different places (database servers, databases, hosts, shards, etc.) is basically the default in many systems.

> piss-poor design

> atrocious

> nightmare

I believe that as senior developers we should avoid using this kind of words in the discussion of design decisions. You can psychologically imprint somebody, and it could distort their understanding of the technology being discussed.

A common truism is that every design decision has different trade-offs. Many of the decisions that led to the NoSQL revolution could certainly be described as "atrocious" and "piss-poor design" from the point of view of traditional relational model, yet here we are. Let's not put on blinkers wrt what could be discovered by traversal of design space.

> a notion of Entity - which in its essence may be just a database row

On the contrary, as much as I'm trying to investigate the "just a database row" view, I'm very much interested in the idea that database row is completely wrong abstraction level for modeling entities.

> A good strategy to follow to midsize applications is domain:operation:who, ex: storageService->getUserAvatar(currentUser)

I'm not sure what "good" and "midsize" mean here exactly. What changes when the applications are tiny and huge? What would be the "bad" strategy, and how does your measurement device work?

> I'd recommend you read into, such as onion and hexagonal architectures.

I don't know about the onion, but afaik, hexagonal architecture is a package of acceptable principles. This may be good, especially as a kind of rule of thumb, but at this phase of substack narrative we want to disassemble all kinds of packaged offers to see how they work.

> separate from the proper schema implementation

Or you could accept the hypothesis that JSON-based schemas (and also key-value or EAV schemas) are just as good as the classic relational schemas.

> user.favourite_color as a automagically mapped dynamic field, both for performance and maintainability reasons.

I'm not sure I understand. In the beginning you complained that this approach is:

> something to be managed in 3 different places

so there is no automagic at all (and that is one of the goals).

> mix the concept of async code with network streams.

Of course you have to mix different concepts to achieve the desired result: reducing the time to first byte, and utilising the parallelism between server and client. It requires cooperation from the entire state, so there is nothing wrong with talking about relevant parts simultaneously.

Expand full comment

Reply (2)

Jul 4, 2021

(see part I first)

Part II

> a notion of Entity - which in its essence may be just a database row

It may or may not be just a database row. In the case of user, it almost always is, even if you have support information such as auth tokens, avatars or user-related messages. The User is an entity, an Avatar is a private property of the User entity, User Messages are often a couple of entities (folder & messages) managed via a service.

> I'm very much interested in the idea that database row is completely wrong abstraction level for modeling entities.

Call it a domain set. See the paragraph above. Nothing changes.

> I'm not sure what "good" and "midsize" mean here exactly

It often depends on the complexity of the business domain and - as you point out, its subjective. I'd consider small something with less than 10 entities or services (or scoped domains - eg. imagine a graphics transformation library; 10 distinct transformation types is small, even if it requires a non-trivial amount of code); midsize something between 10 and 250, and large something above 250.

> What changes when the applications are tiny and huge?

When applications are tiny, you often don't need to correctly identify and scope your domains, as it is faster to just "throw it together". This should, however, be done only in concept code or well-defined applications, not on something that may evolve as a product. When they are huge, you usually need to abstract further. The way you do this depends on the specifics of your application, and the dependency graph of its internal parts. A well designed modular application will avoid transitive coupling between modules; this facilitates the process of extracting modules for dedicated services, or encapsulate them in domain areas to be abstracted.

> This may be good, especially as a kind of rule of thumb, but at this phase of substack narrative we want to disassemble all kinds of packaged offers to see how they work.

That is the problem. It should be done the other way around; You first design for big picture, then you scope and design specifics. When you start evaluating tiles for a bathroom, you should have a concrete idea of what size the bathroom will be, where will it be located and how it sits on the overall design of the house (is it near where its needed?). If you're just going to build a house based on opinions for surface finishing, you're going to have a bad time - and it gets worse if you think this is foundational. Foundations are revised taken into account the overall architectural design and where the pipes are placed.

Existing architectural patterns are good blueprints because they already provide ways of solving many of the obvious design defects. In my opinion, if your overall strategy is to dissect different ways data access layers are implemented, skipping both most common strategies and their details and tradeoffs and the architectural patterns that come with them doesn't seem a fit approach.

> Or you could accept the hypothesis that JSON-based schemas (and also key-value or EAV schemas) are just as good as the classic relational schemas.

They aren't. JSON could be, but lack of proper integers is a serious problem in many usage scenarios (I could get into this, but its not related to the overall discussion, so if interested in knowing those shortcomings drop me an email). The lack of proper date format & validation, specially with timezone, is a nuisance and comes with performance penalties when implemented. When using the object attribute notation, it is similarly verbose to xml, but xml/xslt offers a bunch of other features. Key-value are good for some specific stuff, and you can basically build a relational store on top of a key-value store - either because it is the product at hand or you need it, or because you're just reinventing the wheel again, in hopes for a better wheel. EAV is often an atrocity, as you sacrifice both relational consistency and performance, but it does have its use-cases. This is probably a scenario where non-relational databases (such as NoSQL/NewSQL) may make sense, depending if the attributes stored in EAV are first-class attributes or just dynamic properties, and/or if you'll be filtering results by attribute values.

> I'm not sure I understand

You stated: "We want to be able to easily remove (and add back again) any of the “features”, and that includes data attributes.". Your either referring to dynamic attributes, or just plain truncation (wich is bad, very bad). I assumed you meant dynamic attributes.

> reducing the time to first byte, and utilising the parallelism between server and client.

Hum, not necessarily and not necessarily, and that's why I pointed it out. First, async calls aren't faster - more often than not, they're actually slower and can create huge implementation problems when used in contexts where developers don't understand how concurrency works. Also, you're confusing non-blocking network calls with concurrency. While often async implementations will provide non-blocking network calls, it doesn't mean that the code will run in parallell; in fact, in some implementations, it won't - at all. You can also implement non-blocking code that doesn't run in parallell - this can be done by performing a fcntl() or ioctl() on the socket descriptor, and then polling for the result.

The funny thing is, while there are scenarios where async is a good fit, in most cases, it just adds complexity. Take my initial example - a user with an avatar; When you implement an endpoint, building the avatar url will depend on a read on the user table; There is nothing paralelizable there. You won't get faster code by using async, as the result of the API depends on the read on the table; In fact, it will probably run slower, due to added complexity to manage both the connection state and the application flow.

> It requires cooperation from the entire state, so there is nothing wrong with talking about relevant parts simultaneously.

There is. Imagine you need to fetch a user record from the database, as well as from another system (lets say a billing API). You pass the user id, and you're working on middleware without explicit user authentication. Because its an important middleware, you just use async because "it scales better" - you fetch user info from the db, and the stuff from the billing API in parallell. Yaay performance, right? Wrong. You just created a DoS mechanism for your billing subsystem, because you aren't checking if the user exists. But hey, its something internal, right? 6 months from now, some contractor will use your middleware to get some info to a 3rd party system that has a bunch of users that may or may not match yours. Because its event-driven, everytime something is created on their side, you receive an API request on the middleware, with a probably invalid user. This passes tests, as valid users are mapped, and it works, but 3 months after implementation you see your billing system being hammered and you wonder why you can't bill customers. This is why design is so important, and this is why I'm not fond on presenting "solutions" for "simple" problems. They often aren't simple - at all.

Expand full comment

Jul 4, 2021

(sorry for the long response, this one is in two parts)

Part I

> In my environment this is sometimes a big problem

Not everyone's experience is equal, and that is part of what makes sharing approaches and solutions interesting. However, if you have parts of code where you should be using "WHERE id IN" (or something vaguely similar), it seems you have a "stack smell" - whoever is producing code is opting for advanced usage (such as ditching typical ORM/ORM-like solutions) without the proper skill, when it probably should be using higher-level abstractions instead. I'm aware that sometimes is not easy - either because the existing ORM is garbage, or because the codebase is already built that way; This is often an opportunity to reimplement base data access with a fresh approach, and should be sold to management as such. An opportunity to cut costs in the long term, not yet another refactor.

> Code completion is a good thing, but it loses to the runtime benefits (if they exist)

As you have stated explicitly, you aren't targeting a specific language. Whatever gains you have from runtime aren't an advantage in your proposal, as you don't know if they exist at all. From the top of my head and using the obivous implementation path, I see no advantage in this - only several disadvantages.

> Also, restoring the ability to do code completion under this system could be a separate engineering exercise

Or downright impossible, depending on the language. On a project, I'd say spending time patching the IDE is considered a waste of resources, specially if there are alternatives you can use. Please note, this is a design decision, and as such it comes with a cost. Can you make a design that fits the same purpose with the same goal without relying on runtime evaluation? From my experience, I'd say yes.

> I consider this complexity essential, because there are actually three things, just as you say. So, there should be three corresponding things somewhere in the code base.

Well, no, it shouldn't. At some point you may be reading entity data from 3 different sources, but you often don't read that data at every access. Your data model should not be tied to the specifics of the backend used. More on this below.

> an Entity (in this case, a user) that will have data stored in multiple places.

This is a misquote. I was referring to the fact that you are explicitly designing your entity attributes to perform on specific backends. That is usually an architectural blunt, as you want to keep your entity definitions and specifics of the underlying storage mapped independently as possible, without requiring explicit entity mapping. This is where OOP patterns shine, because you can take a "User" entity and add specific logic to retrieve an avatar from S3.

> Is it "storing data in multiple places"? It seems that storing different attributes of the same noun in vastly different places (database servers, databases, hosts, shards, etc.) is basically the default in many systems.

No, its mapping specific attributes identifying multiple backends or data storage scopes. As an example, you have a User entity with a bunch of attributes and an id, that is - for sake of modelling - a direct uuid on an SQL system, and an avatar on S3 also mapped into the user table as an UUID field* (this is not a stellar example of segregation of concerns, but lets use it for simplicity sake). Most user operations aren't related to fetching or updating the avatar; Instead, most user operations will be reading other stuff such as the name and status, validating the id, checking passwords and authentication tokens, etc. Also, there is a transitive dependency - you can only access S3 data successfully if you read the user table (for a given user, you will need to determine the image name/id first). However, you can access whatever is on the user table - even the image id (so you can eg. build an url for your authenticated CDN), without needing to hop to S3. By evidencing this, there are clearly "attributes" and "data", even if data is accessed trough attributes. Your proposal mixes the two concepts AND the backend logic each one uses, with no advantage whatsoever and several disadvantages, stated in my previous comment. By using a layered approach (often paired with DI), you can just have User::getId, User::getName, User::getAvatarUrl(), User::getAvatar(), User::updateAvatar() or somewhat similar implementation somewhere in your code. There is no need to specify that S3 reading requires a Key field; it is a transitive dependency from the user table; By using the same approach, any S3-touching code should explicitly access the S3 client instance code via dependency injection (with or without lazy loading, but in some languages this is hugely beneficial); By using this approach, you can even have locally (context) scoped S3 client instances, that would allow access to specific buckets for specific customers - which is great to manage both resource usage and compliance (eg. data removal).

* Why not use the available User id? Because by using a separate field, we can refresh it when needed, eg. when uploading a new avatar, and by leveraging ETAGS, you have automatic cache & cache invalidation taken care of by using the random uuid as ETAG. Why not just use access policies to S3 bucket? Because unless you're actually using one of AWS authentication solutions, you'll be usually performing a read grant for an IP on potential personal data (eg. a guy using its photo as an avatar). While somewhat common, it is sloppy design, and quite frankly, whoever made that decision should probably rething its role in the company.

> I believe that as senior developers we should avoid using this kind of words in the discussion of design decisions. You can psychologically imprint somebody, and it could distort their understanding of the technology being discussed.

I'm off the hook, as I'm no senior developer :) I know I'm not everybody's coup of tea, and I agree with you that most design decisions are tradeoffs. But lets face it, while there is usually no "right", "one-size-fits-all" way of doing things, there are plenty that are plain wrong, obviously wrong, that will cause serious harm down the line. I could care less if a developer gets traumatized because he implemented a shitty solution; I'm not there to cheer him up, I'm usually there to make him grow as a developer and evaluate his work in the process. And growing can be painful, as we all know. I'm usually not an asshole, and mistakes, errors, blunts, miscalculations happens all the time (some of mine are usually quite spectacular) and are a part of business. Sometimes an idea that seemed good just didn't pan out when implemented; If someone tried to re-invent the wheel or just plainly skip the architectural design, either that person has taken a good decision or didn't. More often than not, didn't. And if its senior enough, the proposal should be on par with its experience, even if its not a good decision or a good fit on the specific case. If its not on par with his experience, ofc I will make him aware that I think his design is piss-poor, but more importantly - why. Sometimes they go back to the drawing board and surprise you. Sometimes they don't surprise you - at all.

I still write code often, but my role is usually "big picture" stuff, such as application design and framework design, so its more of an architectural role (often) with management functions. So poking holes at things and conceiving parts that fit together with the overall aplicational design without major downsides takes a non-trivial amount of my time. Sometimes, I also implement them (or at least the skeleton), as I believe a properly scoped implementation is almost as good as good documentation, and takes way less to write.

> Many of the decisions that led to the NoSQL revolution could certainly be described as "atrocious" and "piss-poor design" from the point of view of traditional relational model, yet here we are.

Yes, here we are. Solar roadways, humans on mars, (most) NoSQL databases. Just because a bad idea has a marketing department, doesn't mean its not bad. And just because it is easy to use, it is still often an atrocious piss-poor excuse of a database. The NoSQL adoption was driven by the fact that was promoted as "high-performance database without a learning curve". So basically you can store stuff whithout learing a new language (SQL). No, instead you use functions. Except when you actually need something non-trivial, in that case, you're mostly on your own writing queries in our language of choice. Its a variation of "get rich quick without working" scheme, where in the end you actually worked, you're not rich and you still have to pay. They have their niches, their use cases, but no one on their right mind would use a NoSQL database in a scenario when a traditional RDBMS is a viable option - now or 10 years ago. And I'm not saying this lightly - almost 10 years ago, I was called to evaluate a very well-known NoSQL database (probably the first name that pops into your head) for a product (used in 4 continents, so high profile), and the guy they brought as their technical manager or something like that didn't even knew how the database worked internally (this was relevant for querying bigger-than-memory datasets, one of the benchmarks I performed on the database, based on our workload needs and their own pitch of the capabilities). They basically just went with the typical marketing script, while making fun of both relational databases and sql - even after I've show them that Postgresql wiped the floor with their toy in every scenario we needed.

Expand full comment

Minimal Modeling

Designing data access layer, pt. I