Organizational scalability and flexible database schemas (pt. II)

Feb 01, 2022

This April I’m going to organize a Zoom course on how to make sense of typical databases.

In the previous post we started discussing flexible database schemas, also sometimes called “schemaless”. As an example we discussed modeling many boolean attributes of a single entity — say, 5 to 20 attributes of a restaurant, maybe more. We talked about the most straightforward and naive approach: put all attributes in the main table, we mentioned that we can also use a side table, and we mentioned the approach that naturally comes to mind: “schemaless table” that can store all kinds of different boolean attributes without any database schema changes.

Single-attribute tables

In the “Comprehensive guide to storing boolean values” we discussed that you can also store boolean attributes in a single-attribute table. Let’s try this approach here.

Let’s begin with the same first five attributes: has_delivery; is_pet_friendly; has_play_area; is_vegan_only; is_halal. Each attribute goes in a separate table, so we have to create five nearly identical tables:

CREATE TABLE restaurant__has_delivery( id INTEGER NOT NULL PRIMARY KEY ); CREATE TABLE restaurant__is_pet_friendly( id INTEGER NOT NULL PRIMARY KEY ); CREATE TABLE restaurant__has_play_area( id INTEGER NOT NULL PRIMARY KEY ); CREATE TABLE restaurant__is_vegan_only( id INTEGER NOT NULL PRIMARY KEY ); CREATE TABLE restaurant__is_halal( id INTEGER NOT NULL PRIMARY KEY );

The tables differ only by the second part of their name (in bold). If a certain restaurant has this attribute, the ID of the restaurant will be inserted as a row in the corresponding table. If we need to set the attribute to false, we delete the corresponding row.

Now we need to add fifteen more attributes. That means that we’re going to add fifteen more such tables. How do we feel about this solution?

Three approaches are roughly equivalent

First, we can see that all three approaches (field-per-attribute table, flexible table and a set of single-attribute tables) could be freely “converted” to each other using views (virtual tables represented as an SQL query).

For example, if we have field-per-attribute table, we can “convert” it to flexible table like this:

CREATE VIEW v_restaurant_attributesAS SELECT restaurant_id,1 AS attribute_id, has_deliveryAS value FROM restaurants UNION ALL SELECT restaurant_id,2 AS attribute_id, is_pet_friendlyAS value FROM restaurants UNION ALL SELECT restaurant_id,3 AS attribute_id, has_play_areaAS value FROM restaurants -- … etc.

Here “1”, “2” and “3” are the numeric ids of attributes: basically they’re hardcoded somewhere in your codebase (or in the lookup table).

We can convert field-per-attribute table to single-attribute tables by a number of views:

CREATE VIEW v_restaurant__has_deliveryAS SELECT restaurant_id FROM restaurants WHEREhas_delivery= 1; CREATE VIEW v_restaurant__is_pet_friendlyAS SELECT restaurant_id FROM restaurants WHEREis_pet_friendly= 1; -- … etc.

Other four conversion views (six in total) are left as an exercise for the reader. You can see that those views could be created and updated automatically by a script if your physical tables have uniform structure.

Organizational scalability of three approaches

Later, in a separate issue, we’ll talk about the physical representation of three approaches, and discuss how they would behave with common operations, such as queries, adding new attributes, removing unused attributes, etc.

But for now let’s talk about why people seem to love flexible database schemas.

My theory is that this is because DBAs (database administrators) avoid being the schema bottlenecks.

Traditionally DBAs were kind of guardians of table schemas. If you needed to make any change in schema: create a new table, add or remove a field, you had to go to DBAs and ask them to make this change. They had necessary permissions (that you often did not have). They could sanity-check your request and suggest a different way. They held many competing concerns in mind, such as disk space, learned best practices, knowledge of how the database engine works, knowledge of how typical queries behave, etc. etc. Also, because the data was often somewhat sensitive (such as sales data, etc.), they also had the responsibility to make sure that access to the data is controlled.

Recently (maybe in the past decade) this started to change. The idea of DevOps was introduced, and now more people are expected to manage the database schema on their own. In other words, you can now create new tables and add new fields more or less freely. The role of the DBA still exists, but their responsibilities shifted. However, they still have a strong influence, and if you design your tables in such a way that it creates problems for the system, they will find you and they will talk to you, suggesting ways of improving your approach to table design.

So imagine that you have to run every change in the tables through DBAs to get their approval, like we used to do. Basically, the DBAs will play The Game of Tables in real time, and if their move does not match your move they will become concerned.

Suppose that you already have five attributes, and you need to add another one. Here is how it would go in three approaches:

Field-per-attribute: you ask to add yet another field to the existing table. The table is already quite wide: it has five attribute tables and some other data too. Do you want an index with that? At some point they will certainly be concerned: nobody wants to work with 100-field tables. You will begin receiving push-back.
Table-per-attribute: you ask to create yet another trivial table. The table has a single field — this is unusual, so their DBA senses tingle. They become concerned about a number of JOINs. They are bored to approve such trivial changes, they feel there is too much toil. You will begin receiving push-back.
Flexible schema. You don’t need to contact DBAs. You do everything yourself: you allocate another id for the new attribute, and you can immediately start writing and reading your data. The performance characteristics of this design may not be ideal, but they’re predictable. DBAs are happy.

We can see why people love flexible schemas: they remove a major bottleneck from the process.

But what if we explicitly discuss the scalability of the process of schema changes, to see how they could be solved with other designs too? This is going to be the topic of one of the following issues.

P.S.: I’m going to organize a Zoom course on how to make sense of typical databases.

Minimal Modeling

Discussion about this post