How to delete a lot of data, pt. II

Jun 01, 2021

In this post we’re going to continue discussing how to reliably delete a lot of data in a relational database. In the previous post we discussed simpler approaches and pointed out why they could cause problems and even outages.

A lot of people subscribed after the previous post, and now we have more than 250 subscribers. Thank you for your attention! Please feel free to forward and otherwise share posts from this substack that you think somebody will find useful.

Recapitulation

This section is safe to skip if the previous post is still fresh in your mind.

So, short recap of our requirements:

relational database;
table with many rows (10 million+);
we need to delete a substantial amount (few million?) using an arbitrary condition;
a secondary index would be used for our condition;
the data is replicated from the primary server to multiple replicas;
the table is quite active with other clients inserting or updating other rows;
this whole task has the lowest priority, and should not affect other clients.

By the end of the previous post we arrived to the following solution:

Write a program that repeatedly executes the SQL DELETE query that:
- deletes up to a 1000 rows;
- sleeps a fixed amount of time;
- repeats until there are no more rows to delete.

This solution has the following potential problems:

It may increase the chance of locks and especially deadlocks for the clients that write to our table;
It could cause additional problems during the outage, if the sleep time between attempts is not long enough;
It does not use the server resources efficiently, if the sleep time between attempts is too long.

Dealing with secondary indexes

The main problem that we have with the indexes is that we read an index and simultaneously modify it. This may cause deadlocks. So, let’s stop doing that.

We can first query the table, getting a list of row IDs that we want to delete. Then we could delete those rows, but we would use IDs as a condition. Example:

SELECT id
FROM tblFoobar
WHERE created_at < ‘2018-01-01’
LIMIT 10000;

This query returns up to 10000 integer values. We split this list in batches of say 500 ids and delete each batch:

DELETE FROM tblFoobar WHERE id IN (2, 3, 5, 7, …, 3571);

-- sleep for some time

DELETE FROM tblFoobar WHERE id IN (3581, …, 7919);

And so on.

We sleep between each DELETE for some time, but this would be discussed in the next section.

After we have processed all the batches, we go back to the query step and fetch another 10000 ids. If we get no ids here it means that we’re done.

With this approach we only lock the secondary index (say, the one on the created_at field) for reading (if ever), and it gets unlocked after we get our 10000 ids. Then, as we delete the rows, we update the primary index (the one that covers the id column). Of course, the deletion will also update all other secondary indexes, but those indexes would only be locked for writing (if ever).

The numbers that we’ve chosen are somewhat arbitrary, but they should be good enough for many applications. First, we certainly want to delete the rows in batches, never row by row. The batches should not be too small, and should not be too large. So, why not 500 for the deletion batch?

The reading limit of 10000 means that we have 20 deletion batches, and there are going to be 20 pauses between them. The query is quite lightweight, because it returns basically an array of integers, which is going to be on the order of 64 or 128 kilobytes, which is nothing. You can certainly tweak the numbers, and you should really provide a way to override those defaults for special applications.

Another benefit of splitting read and write is that you can read from the replica, thus conserving limited resources of the primary server.

Dealing with replication backpressure

Replication channel is a shared resource, and we want to be good citizens and use it fairly. Our specific example, deleting unused data, is usually of lowest priority, so we really want to cause minimal or no troubles.

First, we don’t want our changes to pile up: if the replica did not process our first task, there is not much sense in submitting another one. Second, we want to give other users of the replication channel an opportunity to replicate their changes.

We solve both problems by observing the replication and waiting for each change to be processed.

We could implement a kind of replication beacon: a small table that is replicated with the rest of the tables in our database.

CREATE TABLE replication_beacon (
   value BIGINT NOT NULL PRIMARY KEY
);

This table is going to contain a single row, and a single column: basically, it’s a single global variable. In the beginning we initialize it with some value. Then we can monotonically and atomically increment this value, and it’s going to be replicated to replicas. This replication is the smallest possible, so normally it’s not going to cause any issues.

How do we observe the replication? Let’s execute, for example, one of our DELETE queries. It already started replicating, but we want to know when it finishes. Let’s update the replication beacon:

UPDATE replication_beacon set value = value + 1;

This update is now also being replicated. Now let’s read the new value from the replication_beacon table on the main server. Say it’s going to be 12345. We do not care if there were other increments, we don’t need this update and read to be atomic.

Now we can repeatedly query the replication_beacon table on the replica server, waiting until it becomes 12345 or more. When it happens, our DELETE query has been replicated too.

We assume that the replication is single-threaded, which should be the case for relational databases. Other distributed databases that could replicate each table independently would need to be handled in a different way.

The solution just described is very simplistic (though it should work as a general idea). You need to implement it in your specific database environment. In the next section we’ll discuss what needs to be handled outside of the happy path to make this solution truly reliable.

First-class replication: an implementation

So, the first question is: who and when is going to make sure that the replication_beacon table exists and contains one row? The easiest way is to create a small script that does exactly this. We also need to make sure that everyone who has database administration duties is aware of this script, this table and their purpose, and knows how to troubleshoot.

The script needs to handle as many corner-cases as you can imagine, including the theoretical ones. Does the table exist? Could we create it? Is it readable? If not, can we add or update the necessary grants? Does it have a single row with a value? If we tried to insert that row, did we succeed? If there are more than one row, and we tried to delete the other rows, did we succeed? Does the table have the correct structure? If not, how do we report it to humans? Did the value overflow?

Is this even the best structure for this kind of table?

The second question is: we must give the developers an easy and reliable way to use this functionality. So, somebody must carefully implement the module that integrates with your database environment and your software development environment and handles all possible conditions in a centralized way.

One suggested API would consist of two functions:

increment-and-return the value on the main server;
wait-for-the-value on a specified replica (by default, it could use the “currently connected” replica).

Both functions seem to be pretty straightforward, but they should be really robust and handle all possible failure cases, including theoretical ones. Practice shows that sooner or later you will trigger all of them, no matter how unbelievable they may sound.

So, the “increment-and-return the value”: did the UPDATE query succeed? Could we read the new value after the update? What if we lost the connection to the main server between those two queries? If there were any problems with both queries, did they get reported properly: the human-readable error, the database server host, the client host. How are the timeouts handled? What if the main server failed over between two calls of this function? What else could happen here?

Now, “wait-for-the-value”: how often are we going to query our table? Unfortunately, there is no way to “subscribe” to the update, so we must use the polling approach. Maybe 100 ms would be good enough? Could we even connect to the specified replica? What happens if the replication is broken completely on this replica and it does not advance at all, what is the global timeout? How do we handle the missing or empty table?

The key thing here is that all of this should not be a concern for the ordinary developer. This needs to be handled by the senior engineers who own the core data access modules. The ordinary developer needs:

An explanation of the issue we’re solving with this;
A list of cautionary tales about outages that happened when people disregarded this advice;
A merge request review policy about this issue;
A clearly defined API, with error cases explained, and designed in a way that it’s hard to ignore errors;
Clear instructions about where to ask for help in case of problems;
Clearly defined way of receiving advice on handling non-typical needs and requirements;

This is a tricky moment for this substack. We go into a lot of details here that people should be aware of. However, they should not always have to care about every single detail, because it should ideally be handled on the company-wide level.

Dealing with the distributed system

The previous text sweeped under the rug an important question. So, we have a primary server and many replicas. We observe the replication, using one of the replicas. But which one? Generally every replica could be lagging in a different way, and they could all have separate outages and issues during the time our batch deletion runs.

Let’s outline possible strategies. The simplest one is that we choose a random replica and work with it all the time. The potential problem with that approach is that this replica could be an outlier. For example, it can be very fast and healthy, so we’re going to do our work very quickly, overloading other, slower replicas. Or, this replica can be slow and unhealthy, so we’re going to be unnecessarily slow (which is a lesser problem than the previous one). Also, any replica can go down while we are connected to it, for many different reasons. Fortunately, the solution for this is trivial: just reconnect to a different one.

Another approach is to switch to a different replica after some time: maybe a number of batches, or a number of wall clock seconds. This should greatly reduce the probability that one outlier introduces unnecessary imbalance.

You may think up some other approaches as an exercise. It’s interesting that even in this relatively simple example we potentially have to deal with all the complexity of a typical distributed system. I think that if you imagine some sort of a demon-type (as in “Maxwell’s demon”) malicious chaos monkey that observes your system and strategically introduces failures, then any possible algorithm of choosing the replicas to observe replication can be beaten by this monkey. Your task will either run very inefficiently, or it’s going to fail. Thankfully, failing is not a big deal for this particular task: you can always run this code some time later, hoping that the monkey is distracted.

If I were to give any advice on how to deal with distributed systems, it would be:

Embrace failure: design components in such a way that it’s ok for them to break sometimes;
Trim accidental complexity: there is an intrinsic complexity of the real-world problem that you solve, and the accidental complexity of a solution. Try to keep the accidental complexity within budget.
Design for idempotence: design components in such a way that any operation could be repeated multiple times; after previous failure it would catch up and attempt to finish the work; after previous success it would just do nothing.

Outtro

As usual, we want each post to be open-ended. In this post we’ve discussed a number of topics, and we hope to return to them again in the future. Summarizing, the pieces of this particular puzzle are:

avoiding N+1 queries by processing multiple ids;
decoupling read from write;
using backpressure to improve reliability;
a conflict between scalability and speed;
handling non-happy paths;
treating change propagation as a first-class citizen;
API design;
improving software development velocity;
dealing with the complexity of distributed systems;

Let’s discuss something else next week.

Jun 16, 2021

Please don't take this the wrong way, but following on the idea that you want to keep it generic in terms of engine, "WHERE IN" with a hardcoded list is probably the possible worst approach you can use, specially with large static lists. Its actually less efficient because you are performing two index scans (one to generate the IN range) and the other one to actually perform the delete. The canonical form "DELETE FROM table WHERE ID IN (SELECT ID FROM table where created_at <=x)" has both the advantage of potentially generating a temporary table, avoid row locking and is actually a single transaction. Also, be aware that EXISTS instead of IN may yield significant performance improvements on some SGBDs, and some SQL engines *will* crap their pants with large IN static lists (such as Clickhouse's one).

Expand full comment

1 reply by Alexey Makhotkin

1 more comment...

Minimal Modeling

Discussion about this post