Before we think about physical representation of data in the database, perhaps we could make a list of data types commonly used in database modeling. This post is going to be quite dry, but this would be basically a roadmap.
At the moment we’re concerned mostly with data types only for the attributes. We divide them into pure and structural types, and we’re more interested in structural ones. In particular, we’re going to discuss either/or data below.
There are also several related concepts that we won’t cover here, but we mention them in the last section just to pique your interest for the future posts.
Pure attributes, roughly speaking, occupy a single field in a single table. For example, such attributes as age and name of the user are pure, because there is no difference in storing this value if the user is 10 years old or 80, or if their name is Ann or Stephen.
Structural attributes may span across different tables and different fields. Together those fields would encode the attribute value. Usually there could be different possible representations, and we’re very much interested in researching the design space of this area.
Pure attributes
1/ There is a wide variety of numeric types in all databases. Integers of different sizes, signed or not, and fractional numbers, both floating point and fixed point (also called decimal) with different precision.
2/ String type is also widely used. There are also many technical differences: maximum length, character set, disk space allocation methods, etc.
3/ Blobs are binary objects that may look roughly like the ordinary files: the arbitrary sequences of bytes. Blobs are most often used to store image files.
4/ Date/time is an important data type, also with a lot of variability: precision, timezone, minimum and maximum date that could be encoded.
Can you remember any other types for pure attributes? Please send your comments, because we’re out of ideas really.
Structural attributes
5/ Boolean values are probably the simplest structural data type. Note that sometimes boolean values are represented as tiny integers, with 0 meaning false, and non-zero meaning true; but this is just an implementation artifact. At the logical level we consider boolean values to be a kind of enum value.
6/ Enum is the data type that consists of one or more fixed labels. For example, for the order processing you can have an enum with the following list of allowed values: “pending”, “processing”, “sent”.
7/ Pairs are the data type that consists of two values that are so closely related to each other that it makes little sense to treat them separately. So far we know a small number of examples:
amount with currency;
geographical coordinates (latitude + longitude);
width and height of a picture.
If any other examples come to mind please leave your comments, this is a very interesting topic.
Of course, the idea of a pair could be extended to 3 or more values, for example the dimensions of a container: (width, height, depth). This kind of abstract fixed list of values is called tuple.
8/ Either/or/or types are the topic of our current investigation. In some programming languages they are called algebraic types. Also, when you hear the term “pattern matching”, it very often means dealing with either/or/or types. There are many different syntaxes for representing this kind of data, so let’s use some nondescript pseudo-syntax.
Suppose that we want to ask our hotel guests how they would rate our breakfast, on a scale of 1 to 5. However, some guests did not have breakfast at all, for one of many reasons, and we’d like to know the reason, too. Also, it’s possible that some people just did not bother to fill in our survey at all, so we must record this fact too, so that our statistics is not skewed.
breakfast_survey_response =
RatedBreakfast (rating: <n>) |
DidNotHaveBreakfast (reason: <enum>) |
DidNotFillSurvey ()
If the user rated our breakfast with 4 stars then the value of breakfast_survey_response would be RatedBreakfast (rating: 4). If the user did not have breakfast because it was too early, the response is DidNotHaveBreakfast (reason: too_early). If the user ignored the survey, the response would be DidNotFillSurvey ().
In one of the previous posts we’ve discussed a somewhat similar representation of this data in JSON: https://minimalmodeling.substack.com/p/database-constraints-pt-ii.
How could we store all of this data in our relational database? There are many possible ways to do that, with their own pros and cons in different contexts. This is going to be the topic of the next week’s post.
Further topics in data types
Let’s briefly list some additional topics that would be discussed in this substack in the future. All of those concepts are deeply structural, but they are so important and/or complicated that they deserve their own post or maybe even a number of posts. Subscribe below to receive regular email updates on database modeling from us.
9/ Ids of different entities could be integers, strings, or even composite values such as pairs. Ids roughly correspond to the idea of a primary key, if you’re coming from the relational background.
10/ Counts are a special kind of numeric value. Counts appear in some contexts, mostly as the secondary data, so we’d like to kind of single them out. For example, we could count a number of logins made by the user, or the number of comments that they have left. Treating those counts is an interesting topic.
11/ Addresses and names. There is a rich programming folklore about handling (and mishandling) this kind of data. This topic certainly deserves a comprehensive discussion, because addresses and names are encountered everywhere in various databases.
12/ NULLs could technically be considered a kind of either/or value with a special handling by almost all databases. NULLs are a topic of endless debates since the dawn of time of database processing. We’re not sure if we could contribute anything here, but this theme simply cannot be avoided.
13/ JSON data type and its subsets like arrays are now a feature of most databases. This soft-schema technique is important but it could be a double-edged sword, so we must discuss it too.
14/ Polymorphic associations arise when we want to treat different things in the same way. For example, if we sell books, music and accessories on the same e-commerce website, those three things will have some common attributes (such as price), and specific attributes (only the book has the number of pages). Dealing with this kind of data is very common, and just as common are confused questions about its optimal modeling.
15/ Link types deal with things like “User rates Movie”. This is one of the fundamental concepts in our modeling approach, and there are some open questions about how to handle this kind of relationship.
16/ Finally, what is the type of entity together with its attributes? Like, if we have a User entity with age, name, login name, avatar and password: what type is it? Will this type change if we add more attributes? Does it even have the single type? Does it even have a type at all?
Next week: physical representation of either/or data.
15: hypermedia-orientation (hypermedia-oriented design; read "modeling hypermedia-based communication") with a domain-driven design infrastructure (see microservices.io and a node js lib called CASL)