A comprehensive guide to storing boolean values

Jun 08, 2021

Booleans seem to be the simplest structural type (our definition of structural and pure types could be found in the “Structural and pure attributes” post.). This substack was started back in January with a seven-part treatise on either/or types (see “Table of Contents” > “Data types” > “Structural data types” > “Either/or”). Booleans are much simpler, so we can expect to cover them in no more than two or three articles.

Logical level

On a logical level boolean attributes keep a yes/no answer to a question about a noun. For example, “Is this post published?”, “Does this hotel have a gym?”, “Does this patient have allergies?”, etc.

A boolean data type has two possible values: true and false. Many programming languages support boolean values directly, but because of historical reasons it is common to have an integer-based representation, coerced to boolean and vice versa. Zero usually means “false”, non-zero (usually 1) means “true”. We must discuss this because in SQL land boolean values are supported only half-heartedly. Some dialects do have some support for first-class boolean values, but we want a broader coverage of commonly used databases, so we will use the integer representation.

What happens if a boolean attribute is not set for some reason? For example, the answer is currently unknown, or it was not yet provided. It seems that specifically for the boolean values this is an important distinction.

Some questions have some sort of a default answer, for example “Does this hotel have a gym?” If nobody bothered to find an answer to that question, we may safely assume that the hotel does not have a gym. So if somebody is looking specifically for a hotel with a gym, they will basically have to choose between the hotels that both have a gym AND provided this information.

There is, however, a very important class of such questions where we really must know the answer, and if it is unknown or was not provided then we must also know that. Such questions arise mostly for legal and medical reasons. “Does this patient have allergies?”. We may not assume that they don’t have allergies if the answer was not provided explicitly.

JSON representation

Discussing either/or values, we found it useful to look at the problem through the prism of JSON. It allows us to shed many relational assumptions. Here the hotel is going to be represented by the following JSON object:

{ id: 12345,
   name: “La Villa”,
   ...
   has_gym: false
}

Pretty trivial, and there is not much to talk about here, except for one suggestion. Often it makes sense to omit entirely the boolean attributes that contain “false” value. It’s easy to make your client handle such cases automatically: in many languages

if (hotel.has_gym) {
 …;
}

is going to do the right thing if the attribute is either false or missing.

This change is trivial, but it may substantially reduce the size of the JSON response. Сhanges are often propagated from the server to the client as JSON-encoded values, and one of our strategic goals is to work on decreasing the propagation traffic.

Physical level, pt. I

Let’s first investigate the easier case: questions with a default answer. We have a noun: “Hotel”. We need to add a “has_gym” attribute. We have two possibilities:

use the main table, such as “hotels” (or a side table, such as “hotel_facilities”);
create a per-attribute table, named “hotel_has_gym”.

This is how the main table would look like:

CREATE TABLE hotels (
    id INTEGER AUTO_INCREMENT NOT NULL PRIMARY KEY,
    -- … other fields …
    has_gym TINYINT NOT NULL DEFAULT 0
);

has_gym: the name of the attribute; we’re not certain if that’s a perfect name, but we’ll discuss naming in general some other time. For the boolean attributes specifically an interesting question is: will the name contain any sort of negation? For example, in some situations we may want to use the name is_active, and in other situations we may want to use the name is_inactive. More on that later.

TINYINT: the physical type of the column. We need only one bit to represent boolean values, so we may want to use the smallest integer type to represent it. Some databases have a TINYINT type that contains one-byte values from 0 to 255 (or from -128 to 127); other databases have a BIT type that is literally a bit of data. Of course, if your database properly handles SQL boolean types, go ahead and use it. Note that in the TINYINT case we only use values 0 and 1, so the other 254 values are left unused (this is more interesting than it sounds).

NOT NULL DEFAULT 0: because we’re currently looking at the attributes with a default value, here is what we get: we assume that if the value was not provided then we may assume that the hotel has no gym.

Physical level, pt. II

Let’s create a single-attribute table:

CREATE TABLE hotel_has_gym (
  id INTEGER NOT NULL PRIMARY KEY
);

For this particular case we can manage without a separate column for the boolean value. Basically, if the hotel has a gym then we’ll insert its ID into this table. Thus, if a hotel’s ID is not in this table then it has no gym (or we did not enter this data). Setting the attribute value to “false” is equivalent to deleting the row from this table.

This is one of the reasons why we call booleans structural types, and not pure types like numbers or strings. If it was e.g. a name of the hotel then we would have to add a separate column for the string value.

This representation is most efficient if the number of true values is much less than the number of false values (for this example we know that gyms are not a common hotel feature). Then this table is going to have a relatively small number of rows. Because the id field is marked as a primary key, looking up the “value” of this attribute is going to be very efficient.

Physical level, pt. III

Now we can discuss the more complicated case: yes/no questions which must be answered explicitly. There is a school of thought that states that NULLs are the go-to answer here: you declare the has_allergies field nullable, and then has_allergies=1 means true, has_allergies=0 means false, and has_allergies=NULL means no data.

However, I’d like to skip the discussion of this representation altogether, and make the following statement:

This section clearly could have had a better title!

Oh, sorry. Wrong window. I mean:

Yes/no questions are often enums in disguise.

In particular, the logical representation of possible answers to “Does this patient has allergies?” is really an enum type with the following variants:

HasAllergies;
HasNoAllergies;
NoAnswer.

And it should be better for everyone if we just modelled our data starting from that.

We’re going to discuss modeling enum types (and NULLs, too) in the followup articles.

Jun 16, 2021

Nitpick: your JSON optimization should be avoided like the plague. The optimization you propose is NOT equivalent. The if condition will be evaluated true and executed for ANY non-empty value on the property hash_gym. This, of course, includes "false" (as string), "f", "F", "NO", whitespace or any invalid data.

Keep in mind, you're not transforming the attribute hash_gym to an optional attribute - you are basically adding both implicit and explicit behaviour on the property (if it doesn't exist, treat is as optional and assume false, if exists, you could care less on the contents and assume true). This makes API testing a nightmare, specially if you're exposing it to people outside your team.

Expand full comment

5 replies by Alexey Makhotkin and others

5 more comments...

Minimal Modeling

Discussion about this post