Erasure

Mar 16, 2021

Before we can continue the discussion of physical representation of either/or data, we need to discuss the idea of erasure. In computer science you most often hear about the “type erasure”. If you write a program in the programming language with stronger compile-time type checking, the resulting program would often contain no type information (it is erased). We can find interesting parallels between this process and the process of data modeling.

NB: the field of type checking is extremely complicated and normally requires a lot of precision in reasoning. However, our real goal here is to establish the idea of erasure in database modeling. We use type erasure just to illustrate that concept.

Type erasure

We’re going to use JavaScript as example. Thankfully, the most trivial examples are enough for our purpose, so the following text requires zero experience with JavaScript.

JavaScript is a dynamically typed programming language. There is also a more advanced extension of JavaScript called TypeScript. TypeScript is a statically typed programming language. Static type-checking helps the compiler to find many categories of bugs in our code. Let’s write a trivial function in TypeScript that calculates the sum of array elements:

// File: sum_arr.ts
function sum_arr(arr : number[]) : number {
    let result = 0;
    arr.forEach((num) => result += num);
    return result;
}

console.log(sum_arr([2, 3, 5, 7]));

// prints 17, which is 2 + 3 + 5 + 7

The first line is the most important for our discussion. It declares that the function takes an array of numbers (the “: number[]” part) and it returns a number. With the help of type declarations the compiler is able to find many different bugs in our code. For example, if we accidentally use some function that is supposed to work with text strings and not with numbers, the compiler will complain and refuse to compile this program.

The funny thing about TypeScript is that if you delete all the type declarations, what is left would be valid code in JavaScript:

// File: sum_arr.js
function sum_arr(arr) {
    let result = 0;
    arr.forEach((num) => result += num);
    return result;
}

console.log(sum_arr([2, 3, 5, 7]));
// prints 17

It is a perfectly valid program, and if you did not use TypeScript, you could write exactly this in JavaScript and it would be correct code. Thousands of people write in JavaScript every day.

So, TypeScript programs could be converted to JavaScript by removing the type declaration: that is, by type erasure. Your browser could not execute the TypeScript program directly, so it needs this sort of preprocessing. Or, you could write a program directly in JavaScript and then run it in the browser directly.

So, why do we need TypeScript? Because when we change something in our program we could make a mistake. TypeScript compiler then would be able to catch some of those mistakes because the type declarations are still there.

If you change JavaScript code, those mistakes could only be found by other means: automated and manual testing, thinking hard, reviewing code, or running your code in production and waiting for complaints from users. There is nothing wrong with all those approaches, really: your other engineering practices might help you mitigate the consequences of potential errors.

If we would lose the original TypeScript source of this program, we could continue changing the JavaScript code directly, just with less help from the compiler and with all associated risks in understanding the original code.

So, we have the following hierarchy: program.ts -> program.js.

Let’s draw the parallels with the database.

Logical schema erasure

Often we think about database schemas in terms of tables, fields, primary keys, normal forms etc. Many of us learned this approach when we were learning database programming. But there is certainly a level above this one. We can see that because there is more than one way to represent many data structures via tables.

We saw that in the previous posts: you can put a “has_symptoms” attribute either in the main table, in the side table or in the per-attribute table. However, our pseudo-code higher-level representation (logical schema) stays the same:

HasSymptoms (symptoms: array[]) |
HasNoSymptoms (pcr_test_result: boolean)

The number of possible logical schemas is less than the number of possible physical tables schemas, as we saw above: there are three physical schemas for one logical schema, and that’s only for one of the concerns of physical representation.

There is a parallel with the type erasure example: the number of possible TypeScript programs is less than the number of possible JavaScript programs. First, we can write JavaScript programs with bugs that would have been caught by the TypeScript compiler, and those programs would be valid (albeit buggy) programs in JavaScript. Second, the TypeScript compiler can reorganize the code in different ways for the sake of optimization, so the same function can be represented in JavaScript in many ways, depending on our optimization needs.

Another parallel is that often you do not have access to the logical schema of your database: you only have the physical table schema, fields, primary keys and indexes. And you could go quite a long way if you just work at this level: adding or removing a field, adding or removing a table, etc. Thousands of people execute SQL ALTER operators every day, there is absolutely nothing wrong with that!

Of course there is a reason, an idea behind every physical schema. It may not be explicitly documented, but you can gather it from various cues by looking at the structure of the table, at the code that works with those tables, and even at the data that is stored in those tables. Also, you could read old documentation or talk with people who joined your company before you and remember more history.

The same way, if you look at the JavaScript code above, you can quickly assume that this function accepts an array of numbers from various cues: the name of the function is “sum”, so it’s probably not strings; and the fact that the code uses forEach function suggests that arr is an array (also, the argument name probably means “array”). But if you had an original TypeScript code you could be absolutely sure, and you wouldn’t have to guess.

Normal forms

A lot of database modeling advice on the Internet and elsewhere talks about database normalization as a guiding principle. But what if normal forms are erasable, too? In the previous section we made a hypothesis that the level of physical tables may not be the only level to think on about database modeling. Normal forms though are properties of physical tables: we say that they are in this or that normal form.

Does it mean that normal forms do not exist at the level above the physical tables, whatever that level is?

If we look at all the tables that we proposed in two previous posts, we’ll see that they are really pretty well normalized! All of them are at least in third normal form; if we use per-attribute tables they would probably even be in the sixth normal form. Yet nowhere in this substack did we even mention the normal forms until now.

Does it mean that there is a useful higher level of modeling representation where normal forms are not needed? Or did we actually use the idea of normalization without realizing it? Or we just got lucky, the same way that you can accidentally write a JavaScript program without bugs?

Also, if we think through our either/or example keeping normalization in mind, we would see that normalization does not help us in maintaining our invariant: we want to store either the list of the symptoms in “have symptoms” case, or the PCR test results in “no symptoms” case, but having both of them or none of them is incorrect. Table normalization allows us to prevent many data issues, but can it maintain this particular invariant? The answer is not clear. Are there alternative ways to maintain this invariant? How are they related to normalization? Which invariants, generally, can and cannot be maintained by normalization?

Oh and by the way: why do we even want to find the logical schema? Our answer is that reason is that we are guided by the principle of minimality.

There are many questions in this section: we have no answers at the moment, really, just a hunch that this may be an interesting direction.

But to find some additional arguments, we shall continue our long journey towards database representation of either/or data. In the next post we’ll discuss how to actually write either/or data values, using the tables that we’ve proposed in the previous posts.

Minimal Modeling

Discussion about this post