In the first instalment we started talking about data integrity and structural correctness.
We’re going to continue analyzing the “health declaration example” from the previous post. That’s how it was stated:
Suppose that we want to register the health declarations of incoming passengers. They must fill in the form that basically asks:
Do you have any symptoms of COVID-19 from the list below?
If no, tick here: [ ]
If yes, mark all that apply: [ ] cold symptoms; [ ] coughing; [ ] elevated temperature.
So, for each passenger we need to store that they said “yes” or “no”, and if yes, we need to store the list of symptoms. If they avoided submitting the form we must be able to detect it. Also, we must make sure that we prevent the incorrect case: both “no” and a list of symptoms, which would be ambiguous.
JSON representation of either/or
Usually we’re interested in normal relational databases, or something that more or less resembles it. But let’s try to look at another popular representation: free-form JSON.
I don’t really know where we’re going with that, but I have some hope that it could shed some light on the general problem.
How do we encode the health declaration of a passenger in JSON format? The first thing that comes to mind is:
{“has_symptoms”: false} for “declared no symptoms” case;
{“has_symptoms”: true, “symptoms”: [“cold”, “coughing”]} for “declared symptoms” case;
One problem with this representation is that it does not prevent us from incorrect data such as {“has_symptoms”: false, “symptoms”: [“cold”, “coughing”]}.
It’s easy to mishandle this piece of data. If your code checks for ‘if (data.has_symptoms)’, it will not notice the other key that contains ambiguous value. Moreover, the structure of JSON does not encourage you to write code to better check both possible cases.
Let’s try another approach with the following encodings:
{“has_no_symptoms”: {} } for “declared no symptoms” case;
{“has_symptoms”: {“symptoms”: [“cold”, “coughing”]}} for “declared symptoms” case;
This is interesting. 1/ there are no boolean values here, so there is no way to erroneously encode at least one incorrect representation. Having boolean values in general is ok, what is problematic is having them as the structural element, distinguishing between the cases.
2/ The values that we have for both our top-level keys have the same structure: it’s just the object/hash. In case of “no symptoms” we don’t need additional data, so it’s just the empty hash value: {}. Otherwise, we have the hash value with the “symptoms” key that contains the list of symptoms.
3/ In most languages you can easily write the if-conditions as expected: e.g. ‘if (data.has_symptoms)’. But it’s now harder to write the code that accesses the list of symptoms without checking if we’re in the right branch: ‘data.has_symptoms.symptoms’ will fail if there is no ‘has_symptoms’ key.
4/ Of course, one can imagine the code that still manages to return an incorrect response such as {“has_symptoms”: {“symptoms”: [“cold”, “coughing”]}, “has_no_symptoms”: {}}. Arguably, this kind of erroneous producer code is harder to write (or, maybe we should say that it’s easier to write the correct producer code).
5/ This kind of JSON structure is pretty easy to validate with the help of JSON schema. You create schemas for both the “no symptoms” case and the “has symptoms” case and then declare that the valid result is either one or another.
You don’t even need a schema here, you can just require that this piece of data has exactly one top-level key. This is quick and dirty, but may be an efficient solution. You don’t need to consider boolean values in data to decide if the schema is correct.
6/ Having said of all that, we can’t help but notice that even this solution does not prevent us from this, presumably incorrect, encoding: {“has_symptoms”: {“symptoms”: []}}. Here we have the empty list of symptoms. We had the same issue in our first approach too. How can we deal with that? There is no clear answer at the moment, but we’ll continue this analysis later.
What’s next?
Of course, we’re interested not only in this particular example, but in other data of similar structure: the choice of either/or. There are many real-world situations that require this, for example:
If you calculate numerical expressions, you can get either numeric result (“2 + 2 = 4”) or an error (“2 / 0: division by zero");
We can ask our hotel’s guests: “On a scale from 1 to 5, how did you like breakfast?” Some people did not have breakfast for some reason, so we need to encode this possibility too. And then we can go on and ask: “If you did not have breakfast, why?”, and we need to store the responses too (“left too early”, “didn’t feel like it”, “special dietary requirements”, etc.)
Also, we still have the original question: how to encodе this kind of data in a relational database.
Also, we need to talk about the extensibility of this kind of data schemas, and that would lead us to the idea of concatenability in database schemas. Let’s discuss it in the next post.
Hm, the example doesn’t feel like a good representative of either/or situation. Seems like we can use just a flat list of symptoms, with an empty list being equivalent to “has_no_symptoms”.