How to delete a lot of data

In the previous issue we started discussing derived data sets.  There is an interesting and frequently encountered problem that would also allow us to illustrate some ideas about propagation delay: deleting a lot of data from a large table.

Suppose that we have an ordinary table with a number of data fields:

  • There are many rows in this table: say, around 10 million rows;

  • There is a timestamp column that records when the row was inserted.  This column is also indexed.

  • At some point we find that we no longer need a big chunk of the rows, for example many (but not all) of the old ones, so it’s safe to delete them (and we actually do want to delete them);

  • The database is relational;

  • The table lives in a replicated database, with multiple replica servers that are used by ready-only clients;

  • Our table is quite active, with new rows getting inserted constantly (and maybe even some updates).

We want to get rid of some of the older rows, say half of the current table (5 million rows).  After that we want to regularly run the same code every month to continue pruning our table.

This task sounds super simple, it is very common, but at the same time it is very easy to cause an outage if you are not very careful.  I saw many, many people and teams who hoped that they would get away with a bit of corner cutting (and I was one of those people several times too).  So let’s investigate.

Single statement

The easiest way to attempt to delete the old data is with a single SQL statement:

DELETE FROM tblFoobar
WHERE created_at < ‘2018-01-01’;

NB: we use a very simple condition here for brevity, but it could be arbitrary in general (e.g. “older than two years and price < 100”), so as not to suggest the solution via table partitioning. Thank you to the readers who insisted on simpler solution :).

If we would install a database server on the laptop, create a table with 10 million rows of fake data and then run the statement above, it would run perfectly, maybe in several minutes, depending on your laptop’s performance.

However, if we run it on a live production server with replicas and multiple clients, the following will happen:

  • Your table is going to become locked for all of that time (maybe few minutes, maybe more);

  • The clients that insert new rows to this table would have to wait until the DELETE statement finishes; some of them are going to time out, or deadlock;

  • After the data is deleted, it needs to be replicated to replica servers.  There are two main possibilities:

    • Either we use statement-based replication: in that case the same statement is going to be executed on each replica, which is going to take time, and can also pause even the read-only queries for clients connected to that replica;

    • Or we use row-based replication: basically, the master is going to issue 5 million DELETE-commands to each replica, one for each row (this may be batched, but still).  Until all those statements are being processed, the replication channel is going to be essentially clogged.  This may take even more time, during which the new data is going to be not available on the replicas.

Summarizing, we would break the clients in three different ways:

  • Clients that write to the table in question will be blocked while the primary server deletes the data;

  • Clients that expect the new data to be available on replicas with a minimal delay will not find the data and will have to retry (or break);

  • Clients that read the table in question from replicas may become blocked while replica deletes the data;

Repeated LIMIT

It seems that the problem with the previous attempt is that it deletes too much at once.  Let’s delete in relatively small chunks.  We would have to write a small program that shall repeatedly execute the following statement:

DELETE FROM tblFoobar 
WHERE created_at < ‘2018-01-01’
LIMIT 1000;

This operator will delete up to a thousand rows matching our criteria.   Most database servers allow you to get back the number of rows actually deleted.  When that number is zero then we’ve finished our job and we can stop.  For our example this operator would be executed around 5000 times. 

If we try to run this program we would most probably cause a similar outage.   The problem is that each deletion will run much faster, so if we would immediately execute the next deletion, we’d still be flooding the replication channel:

  • If we use statement-based replication then we’ll quickly send 5000 DELETE statements.  Probably some other lucky statements will squeeze in, so a little bit of other activity is going to happen too.  Anyway, the total execution time of 5000 statements will be strictly longer than one single statement, so there would still be a kind of traffic jam.

  • If we use row-based replication then we’ll again quickly send around 5 million rows to delete, and those will effectively clog the replication, again.  Even though some other statements will manage to fit between our statements, we’re not much better than before.

Pause between attempts

The obvious solution to this problem is to let our program sleep for some time after each DELETE statement.  The idea is that we give the servers time to process and replicate each statement, and also to process other activity that happens in the database.  But how long should we wait?

We don’t care much how long our script is going to run.  We can leave it overnight if needed. So, let’s sleep for one second between attempts: it’s going to run for about an hour and a half, or a bit longer.  Would that be enough?

One second is a lot of time, and our servers will probably manage to process the deletion and replication. So, this is going to work, but only while your database is healthy, and stays healthy for the entire runtime of this script (about a couple of hours in this case).    

But what if there is a problem?  For example, what if the primary server is overloaded by some activity spike, or some other code creates a lot of replication traffic in other tables.  Our script is going to continue adding work for the primary server to process and replicate, and it’s going to do that blindly, with no backpressure.

Scripts like that tend to stay with us for many years, usually running on a long cadence, such as monthly.  The database will grow, the database traffic will increase, and sooner or later this script is going to become a contributing factor to an outage.

Also, there are usually many scripts like that created over time.  Some day two or more of them will interfere and collaboratively create more fixed-cadence traffic than our system can handle, especially if there is a spike of other activity.

Can we increase the time between attempts?  Say, to 10 seconds?  We’re not in a hurry, aren’t we?  We can, but then our script needs to run for 15-20 hours, which is a lot of time, because it greatly increases the window of exposure to potential outages.

Also, this approach is a little defeatist: after all, we can hope that most of the time our database is actually healthy, so it can actually handle this deletion process much, much faster: maybe even 1 second is too much, if all goes well.

NB: We could also add some jitter to the sleep time, so that we would sleep for a random time, say, between 500 and 1500 ms.  This may help generally: adding some randomness is almost always a good idea in distributed systems.  But it still does not help with the lack of backpressure and efficient use of server resources.

Index locking

The “repeated LIMIT” approach has another problem.  As we mentioned in the beginning, the “created_at” field is indexed.   Again, here is how our SQL statement looks:

DELETE FROM tblFoobar
WHERE created_at < ‘2018-01-01’
LIMIT 1000;

Most probably, this statement will use the index on the “created_at” field.  So, the execution of this statement will go roughly like that:

  • Traverse the index looking for no more than 1000 rows that match our condition;

  • Lock the parts of the index that mention those 1000 rows, preparing to delete them;

  • Delete the rows and update the index, still locked;

  • Unlock the index.

If there are other statements that insert, update or delete rows from this table, they also need to update the index on the “created_at” field, and so they need to lock this index.  If it’s already locked by our deletion script, they would have to wait.  This wait is not going to be long, but the problem is that it may also increase the chance of deadlocks: one of the queries will have to be killed.  So, with some probability:

  • Either our deletion statement dies (this is a better case, because we can just wait a bit and try again);

  • Or some other random statement dies (this is much worse, because our script has a very low priority).

Here, unfortunately, we must do a lot of hand waving.  The topic of database locking is very complicated, and there are many contributing factors:

  • Table structure;

  • Database server implementation;

  • The number of indexes and their structure;

  • Access patterns;

  • The distribution of data in the table.

Nevertheless, you can expect this access pattern to be more problematic than usual, because it simultaneously traverses the index and modifies it.  Another contributing factor here is that in relational databases some operations are unnecessarily strict.  This is what happens here, for example:

  • we ask the server to delete exactly 1000 rows (if there is at least that much), and 

  • do it with the full ACID semantics, that is:

    • before the execution of this statement there are N rows, and 

    • after the execution there are exactly N - 1000 rows, and 

    • there is no intermediate state where only some of those rows have been deleted, in particular

    • if there is a deadlock then we must have our original N rows, without any change.

But we don’t really need any of this strictness.  We’d be happy to just ask the server to try and delete around one thousand rows: nine hundred would be fine, a bit more would also be fine, and you know what — if you delete even just ten and then something more important happens — it’s not a big deal, we’ll just try again.  But we cannot express this “low-effort approach” in SQL.

Cliffhanger

Summarizing, the “repeated LIMIT with pause” approach has the following potential problems:

  • It increases the chance of locks and especially deadlocks for the clients that write to our table;

  • It could still be problematic during the outage, if the sleep time between attempts is not long enough;

  • It does not use the server resources efficiently, if the sleep time between attempts is too long.

There are now more than 1700 words in this issue, so let’s see what we can do with all of that next week.