In the previous post we discussed simple approaches and pointed out why they could cause problems and even outages. Now let's discuss more complete solution that is more robust and applicable in many scenarios.
Please don't take this the wrong way, but following on the idea that you want to keep it generic in terms of engine, "WHERE IN" with a hardcoded list is probably the possible worst approach you can use, specially with large static lists. Its actually less efficient because you are performing two index scans (one to generate the IN range) and the other one to actually perform the delete. The canonical form "DELETE FROM table WHERE ID IN (SELECT ID FROM table where created_at <=x)" has both the advantage of potentially generating a temporary table, avoid row locking and is actually a single transaction. Also, be aware that EXISTS instead of IN may yield significant performance improvements on some SGBDs, and some SQL engines *will* crap their pants with large IN static lists (such as Clickhouse's one).
I love EXISTS, it should be used more. But it seems that many ORMs feel that IN should be enough for everybody. I understand theoretically that EXISTS is more lightweight, but I find it difficult to understand how it can actually deliver "significant performance improvements". What is "SGBDs"?
I never used Clickhouse, but I looked at the documentation and it seems that handling deletion is even more unusual: https://clickhouse.tech/docs/en/sql-reference/statements/alter/delete/ Also, the entire discussion around transactional locking is obviously irrelevant for Clickhouse, so I guess that subquery would be enough.
Considering less-traditional systems is interesting, but this particular post is I guess more relevant for the normal system such as MySQL.
Please don't take this the wrong way, but following on the idea that you want to keep it generic in terms of engine, "WHERE IN" with a hardcoded list is probably the possible worst approach you can use, specially with large static lists. Its actually less efficient because you are performing two index scans (one to generate the IN range) and the other one to actually perform the delete. The canonical form "DELETE FROM table WHERE ID IN (SELECT ID FROM table where created_at <=x)" has both the advantage of potentially generating a temporary table, avoid row locking and is actually a single transaction. Also, be aware that EXISTS instead of IN may yield significant performance improvements on some SGBDs, and some SQL engines *will* crap their pants with large IN static lists (such as Clickhouse's one).
I love EXISTS, it should be used more. But it seems that many ORMs feel that IN should be enough for everybody. I understand theoretically that EXISTS is more lightweight, but I find it difficult to understand how it can actually deliver "significant performance improvements". What is "SGBDs"?
I never used Clickhouse, but I looked at the documentation and it seems that handling deletion is even more unusual: https://clickhouse.tech/docs/en/sql-reference/statements/alter/delete/ Also, the entire discussion around transactional locking is obviously irrelevant for Clickhouse, so I guess that subquery would be enough.
Considering less-traditional systems is interesting, but this particular post is I guess more relevant for the normal system such as MySQL.