In one of the previous articles we discussed one possible approach to encode either/or data in JSON. Our ultimate goal stays the same: storing structured data in a relational setting. We just need to clarify a few things using a more abstract approach.
So, let’s continue with our example:
Suppose that we want to register the health declarations of incoming passengers. They must fill in the form that basically asks:
Do you have any symptoms of COVID-19 from the list below?
If no, tick here: [ ]
If yes, mark all that apply: [ ] cold symptoms; [ ] coughing; [ ] elevated temperature.
Extending either/or data
Last time we considered the following JSON encoding of this data:
{“has_no_symptoms”: {} } for “declared no symptoms” case;
{“has_symptoms”: {“symptoms”: [“cold”, “coughing”]}} for “declared symptoms” case;
In both cases we have a single top-level key that has an object/hash as a value. Second-level object/hash may also contain keys that would store additional data. We have two cases here, with one branch for each.
How could this structure be extended if needed? For example, requirements could change. Let’s use pseudo-JSON notation:
{“has_no_symptoms”: { * } };
{“has_symptoms”: {“symptoms”: [“cold”, “coughing”], * }};
We can extend the branches by adding new keys where the asterisks (*) are. For example, we may decide that we want to store a result of the PCR test in the “has symptoms” case:
{“has_symptoms”: {“symptoms”: [“cold”, “coughing”],
“pcr_test_result”: “negative”}};
(If we would always require storing PCR test results then it shouldn’t be stored in this branched structure, otherwise it would go against the principle of minimality. Only the data that belongs only to a specific branch should be under that branch.)
It is also possible to add more top-level branches to this structure. For example, some passengers may be exempt from submitting a health declaration, say because of their diplomatic status. We can handle this case by adding the third branch:
{“exempt”: {“reason”: “diplomatic_status” } }
Thus making it an either/or/or/... data.
We can also notice that both of the additions that we’ve just discussed are completely incremental. After we implement the code that sets the new branch or sets the new key in one of the older branches, the existing data does not need to be changed in any way, it’s going to be compatible with the new code. This is an extremely useful property to have, it would help us a lot during the data migrations.
When we’ll be discussing the relational encoding of either/or data, we should strive to achieve and maintain this property!
Reducing the structure of either/or data
The second phase of concatenability (an idea that was introduced in the previous post) is removing. As we know, it’s always easier to add something to a computer system than to remove it.
Suppose that now we decide that we don’t need the precise information about the symptoms, it’s enough for now to have a yes or no answer. The up-to-date structure would look like this:
{“has_no_symptoms”: {} } for “declared no symptoms” case;
{“has_symptoms”: {“pcr_test_result”: “negative”}} for “declared symptoms” case;
{“exempt”: {“reason”: “diplomatic_status” } } for the “exempt” case;
(here we removed the “symptoms” key in the second branch).
We change the application code that generates this structure, test it and prepare to deploy. The new rows in our database would no longer have the “symptoms” key, so we also need to make sure that all the code would be ready to handle this situation. If it is under our control then it would be easier, and would depend only on the maturity of our testing process.
Stakeholder analysis
The question is, though: what shall we do with the data that is still stored in the existing database records? If we’re lazy we could just leave it as it is. The new code will ignore it anyway.
Of course, this decision would increase our technical debt. Let’s try to look at the problem from the points of view of different stakeholders of our database. What happens if we leave the unused data?
From the point of view of a software development manager, it is possible that people who join the team in the future would be slightly confused if they would find older records that contain this data. They would have to find out if this data is still relevant, if it could be used, if it should be used. Does it need to be preserved? Finding out answers to those questions a few years down the road would be a bit of hindrance to the velocity of the development process.
From the point of view of database administrators, this data is part of a standard database backup. Thus, it takes a bit of space in the backup storage, and it takes a bit of time during the backup procedure.
From the point of view of storage budgeting, this data takes some storage space. The important point to understand about space-saving efforts is that we’re interested not in “reducing the space used”, but “increasing the amount of data stored in the space we have”. If we have 100 Gb of data and we saved 5Gb, nobody cares that it’s now only 95Gb. What is important is that we can now store 5% more data without buying a new hard drive (or extending partition, or paying for more cloud block storage, etc).
Some of the data needs to be legally compliant. This particular example could be said to fall under the “personally identifiable information” (PII). It is possible that we would really prefer to get rid of this kind of data the moment it is no longer needed, so that we would have less potential problems with it in the future.
From the point of view of software developers who are busy with changing the code and making sure that it works, removing the data completely would have one additional advantage. When we remove the data completely, the remaining code that is accidentally still using this data would break sooner. We will be able to deal with the potential breakage while this change is still fresh in our minds. Maybe we would even discover that we should roll back the change for now and fix something else before we can try migration again.
Data scientists may also discover that the noisy data makes their models less efficient and may really want to have it cleaned to improve the performance of machine learning models etc. Of course, they would also be particularly affected by the same questions of knowledge management that we’ve discussed in the first bullet point: is this data still relevant?, etc.
We’re going to use this kind of stakeholder analysis when discussing all the modeling decisions that we would need to consider.
In the next installment we’ll talk about data types in database modeling. This would be the last step before we would directly tackle the relational modeling of either/or data.