Flexible database schemas

Dec 16, 2021

There are several ways of implementing flexible database schemas, sometimes also called “schemaless”. In relational settings the most common approach is to create a table with arbitrary attributes of a certain class of objects. Another approach is a JSON-typed field where you can store JSON objects (key-value data structure) that contain more or less free-form data. But why are we doing that?

In this issue we’ll begin investigating the design space of flexible database schemas, driving forces behind them, and the underlying similarities between traditional relational design and flexible designs.

Suppose that we need to model a database of restaurants. We have the usual data attributes: name of the restaurant, address, URL, phone number. We also have a set of linked structured data, such as opening times or a list of cuisines that this restaurant serves. So far it’s pretty clear how to model this. Then, we have a long-ish list of boolean attributes (yes/no questions), such as:

Does it have delivery?
Is it pet-friendly?
Does it have a kids play area?
Is it vegan-only?
Is it halal?
Is it wheelchair-accessible?
Does it have a terrace?
Is it on the rooftop?
Is it romantic?
Is it a Michelin restaurant?

And so on, and so on. When you get serious about the problem domain, the list of questions like that is going to grow. Your product managers will constantly have new ideas on how to serve different customer segments better, and they will want to gather and use all this data. Some attributes are also going to become less relevant as time goes by, and you may want to delete them.

For simplicity, let’s focus on boolean attributes. Imagine that during our initial brainstorm we made a list of the first five attributes (the other five were introduced later). Most probably we have a table called restaurants with the following fields:

id;
name;
address (ok, this one is quite complicated, but let’s ignore that for now);
url;
phone_number.

Where do we put the five boolean attributes? For simplicity, we’re going to add them directly to the restaurants table as five small integer fields:

has_delivery;
is_pet_friendly;
has_play_area;
is_vegan_only;
is_halal.

Encoding of boolean attributes is discussed in “Comprehensive guide to storing boolean values”.

So, our hypothetical table has the following structure at this moment:

CREATE TABLE restaurants (
  id INTEGER NOT NULL AUTO_INCREMENT,
  name VARCHAR(255) NOT NULL,
  address VARCHAR(512) NOT NULL,
  url VARCHAR(256) NOT NULL,
  phone_number VARCHAR(64) NOT NULL,
  has_delivery TINYINT DEFAULT 0,
  is_pet_friendly TINYINT DEFAULT 0,
  has_play_area TINYINT DEFAULT 0,
  is_vegan_only TINYINT DEFAULT 0,
  is_halal TINYINT DEFAULT 0
);

We have 11 fields so far. What happens when we need to add five more (and then maybe twenty more)? We need to play “The Game of Tables”. Straightforward solution is to add five more fields to this table, but at this point we’re probably already a bit uncomfortable about that. We know that we will eventually have dozens of such attributes, maybe even hundreds, and that’s not even counting other, non-boolean data. We’re not too keen to have a table with 250 fields (or are we?).

An interim solution that you may sometimes see is to create a side table that stores this, arguably less important data. We create a table called “restaurant_flags”, add a couple of dozen more fields there, and then we’re back on square one.

An arguably more systematic approach would be to create a flexible, schema-less table for storing arbitrary boolean attributes of a restaurant:

CREATE TABLE restaurant_attributes (
   restaurant_id INTEGER NOT NULL,
   attribute_id INTEGER NOT NULL,
   value INTEGER DEFAULT 0,
   PRIMARY KEY (restaurant_id, attribute_id)
);

Here, restaurant_id is a reference to the restaurants.id field. attribute_id contains a number that corresponds to one of the possible restaurant attributes (e.g.: 6 for “ is it wheelchair accessible?”, 7 for “does it have a terrace?”, and so on). The value field has the same semantics as in the main table scenario: 1 for yes, 0 for no.

This is the final structure of this table. We’re not going to change it, ever. When we need a new attribute, we allocate another integer number, introduce a constant value in our codebase and use it in our queries. This is a very clean, non-surprising and socially acceptable way of implementing a set of not-too-important attributes that will come and go as product requirements change and people experiment with new ideas.

As you may have noticed, our goal is to spend at least 10 posts on every single topic, so let’s see which questions we may ask about the implementations above:

What will the per-attribute implementation look like?
Is the “restaurant_attributes” table really flexible?
Is it schemaless?
Why are we really not satisfied with the straightforward solution with many fields?
What do we do with the first five attribute fields after we introduce the new-style table?
Can we use JSON encoding?
How do we treat non-boolean attributes?
What are the benefits and shortcomings of the “bag of attributes” approach?
What do typical queries look like? Are they convenient to write, what are their performance characteristics?

There would be many more questions that we’ll ask during the discussion that follows. Stay subscribed.

Minimal Modeling

Discussion about this post