Derived datasets: an overview
Last couple of weeks I’ve been thinking about derived data sets. Some things are not traditionally explained as such, yet you can try and approach them all in a uniform way.
NEW: The “About” page of this substack was updated with a draft Table of Contents!
When we think about data modeling, we’re mostly interested in primary data (directly corresponding to some business-level things, such as users, orders, comments etc), a.k.a. source of truth. The main reason to generate derived data sets is speed of querying that data: I’m not even sure if there is any other reason. Or is there?
We know that unchecked data duplication can lead to a lot of confusion. However, the performance concerns are so important that we just cannot ignore the question of secondary data when talking about data modeling. We won’t go into too many technical details: we’re more interested in:
general approach to data dependencies;
cataloging and documenting data assets, including secondary datasets;
untangling implicit and explicit optimizations;
propagation delay and propagation traffic.
In a typical database various forms of secondary data usually take several times more disk space than the primary data.
Arguably, table indexes satisfy the definition of secondary data. Moreover, in most cases they are created to improve the speed of querying data.
One of my personal insights a few years ago was that index could be thought of as a small specially organized table. It is maintained by the database server automatically. In typical relational databases indexes are updated immediately after the data is changed, as part of the atomic transaction.
We could say that indexes have propagation delay of zero.
Indexes are created to improve the performance of some important query, or a family of queries. Wider tables can accumulate many indexes over time. Indexes are not cost-free: when you change the data in the table, all affected indexes need to be updated. In particular, when you insert or delete a row, all indexes need to be updated! Moreover, because this needs to happen within an atomic transaction, index updates need to be carefully orchestrated. When there are many indexes, this can lead to substantial write performance degradation, and sometimes even to unwanted locking of read-only queries.
But the root of the problem is that indexes are implicit optimization. The way database servers work, they can analyze the query and pick the best index to improve execution time. Experienced database administrators and software developers can intuitively see which columns should be indexed because it would most certainly improve the performance of anticipated queries. However, when a certain class of queries is no longer being executed against this table, for example due to refactoring, there is no easy way to find the indexes that are no longer needed and drop them. This can happen even if the query changes in a small way, or even if the data distribution changes over time and the query optimizer decides to stop using an index that was used before.
There are certainly ways to analyze the working database, find unused or suboptimal indexes and fix them. The point here is that maybe if we would establish some general framework for working with derived datasets, we’d be able to improve our handling of indexes too.
A very common configuration for typical relational databases is primary/replica. There is a primary server that holds the up-to-date state of the database. It accepts write requests from clients, updates the data on disk and then sends a copy of updated data to replica servers. Clients that need to read the data can use replica servers and thus do not use the resources of the primary server.
There could be just a handful of replica servers, or more: dozens, hundreds and sometimes even thousands of replicas organized in replication hierarchy.
Replication is not instant. There are many factors: replicas can be overloaded and not able to process updates from the primary server in time; primary server can be overloaded and not able to push the updates to all the replicas in time; network can be flaky and so the data connection would be stalled; replication could break completely, requiring manual or automatic intervention. In short, there is always a replication delay.
In a healthy system replication delay is measured in milliseconds, but it can quickly grow to seconds or even minutes in case of failures and outages. We must engineer our systems so that they would tolerate human-visible replication delay.
We’re going to discuss the complete approach to this method of data replication in one of the follow-up posts.
Another extremely common approach to optimizing queries is creating a derived table with the ad-hoc structure and maintaining it when the source dataset changes. Such tables could be implemented with varying levels of sophistication: starting from a simple cron script that runs every hour or every day, up to some kind of formalized data pipeline that is explicitly defined.
How do we detect that the data is changed? In some cases we can afford to regenerate the derived table completely, but this may quickly stop being feasible as the source dataset grows. We may want to update only the part of the derived dataset that depends on changed values. This is possible in many, but not all cases, depending on how exactly the derived table is produced from the source data. We’re going to discuss this topic separately.
For now, let’s discuss the propagation delay in this scenario. So, we changed data in the source table. When will this change be reflected in the derived table? When do we even want it to? In many cases we do not really need immediate updates: a very important class of such derived tables is all kinds of monthly, quarterly and yearly reports.
Propagation delay for such scenarios could be calculated by an unusual formula: before the end of reporting period the propagation delay is zero (because we won’t be using the data), then, as soon as the period ends (plus some extra settlement time) we expect the derived table to be generated and available in some acceptable time. After the generation is finished, the propagation delay is going to stay zero till the end of now-current reporting period.
Of course, we could begin preparing this derived data before the reporting period ends, but we need to make sure that there are no loose ends at the end of this incremental process.
We’re going to discuss the handling of derived tables in one of the follow-up posts, but for now we could also note that there is a question of keeping track of such tables.
We may find ourselves in the situation where some script diligently updates a derived table, and it’s actually quite healthy (or not), but nobody reads that table. This is very common, and this kind of waste of CPU and disk resources can have a substantial monetary impact. It can go on for years. As usual, we’re more concerned with the following question: if we get rid of the unused derived table, how will the performance of more useful derived tables improve? Will they be available earlier, maybe? We’re less interested in the question of just saving a few gigabytes of disk space, but with the question of “how can we postpone the buying of the next batch of disk drives?”.
Derived tables could also be thought of as a kind of specialized table indexes that are maintained from outside of the database server.
Another common approach to improve database performance is caching data, using tools like Memcached or Redis. This is useful to add here because we want to think about data modeling in non-relational contexts too.
The general idea is simple: before we execute the database query, we check the cache. If it already contains a value we use this value. Cache is much faster than the database query, so we win a lot of time. If the cache does not contain a value or this value is out of date, we actually execute the database query, store the result in cache, and use it. This way multiple clients can cooperate, because only one of them needs to pay the price of executing the query, or alternatively we could say that this price is spread out across all clients.
Again, we have the same concerns: how do we make sure that the cache changes when the underlying table does? Often we cannot query the table because that would eat away all or most of the benefits of caching in the first place.
One common strategy is expiration time. We choose a reasonable, but arbitrary time, say 15 seconds, and the values stored in cache would be invalidated after that time. So in this case we’d only have to do four database queries every minute. It is possible that the actual data is going to change one second after the value is stored, and the data is going to be not up-to-date for 14 seconds, but maybe we’re okay with that in this particular case.
There are several other strategies of dealing with caching in database contexts, but we’re not going to discuss them right now. What we’re interested in is that: a) we could say that there is a propagation delay for cached data, and we could calculate it, at least in principle; b) sometimes cached representations suffer from the same issue as the other derived datasets: for example, a part of cached value is not used by any client, but it is still diligently calculated and stored in the cache. Also, in the proactive caching scenarios we can again find that the cache is filled with unneeded data; this decreases the overall efficiency of the caching system.
You may think about cache invalidation explicitly in terms of data dependencies. This would allow treating the database caching systematically. Maybe we’ll discuss this some other time.
It was a brief overview of how the general principles manifest in different contexts. This issue tried to focus on propagation delay (but we also discussed other topics). We will certainly discuss both data dependencies and documenting derived data sources much deeper in the future.
The topic of implicit optimizations is quite subtle and I’m not really sure how to approach it. I find the idea of erasure very relevant here, but I cannot put it into words just now.
Also, this time I have no idea about what’s going to be the topic of the next post!
P.S.: The “About” page of this substack was updated with a draft Table of Contents!
P.P.S.: Follow me on Twitter: https://twitter.com/alexeymakhotkin.