cassandra deletes best practices
Looking to reclaim space on a large table. The table has old data which is no longer required and can be deleted. The deletes are based on partition key, there are about 500k partition keys to be deleted.
-
Would it be better to run the deletes in batches say 50k or 100k in one go? what might be a better batch size (batch here implying how many deletes can be run in one go)?
-
If the deletes are being run from cqlsh, will cqlsh act as client and connect to diff nodes as coordinator node for each delete or will the node from where cqlsh is started acts as co-ordinator node and all the deletes fired from there?
-
what are the best practices to run massive deletes/cleanups? any specific dos and donts?
First thing that you need to remember in Cassandra is that deletes really increase disk consumption, not decreasing it, until the compaction happens and old data is deleted. The Last Pickle has a great blog post on that topic.
Regarding your questions:
- Batches on different partition keys are heavily increasing a pressure onto coordinator node, so they aren't recommended, especially such big. Prefer to delete one by one
-
cqlsh
always sends commands to the same host (this is enforced by WhiteListPolicy) that acts as coordinator that then forwards traffic to node owning that data. - I would recommend to use external tool, either Spark + Spark Cassandra Connector, or you can use DSBulk to perform deletes as well, by using a custom query, something like this (assuming that you have CSV file with all values for partition column(s) that you want to delete -
:pk
the name of the column in the header of CSV file, andpk
- name of partition column in your schema):
dsbulk load -query "DELETE FROM ks.table WHERE pk = :pk"
In this case DSBulk will correctly send data directly to nodes owning the data, avoiding the pressure on coordinator node.