Got too many tombstones? This blog post will talk about how to deal with tombstones once you already have them. For more information about tombstones, check out this post: Examining the Lifecycle of Tombstones in Apache Cassandra.
Verifying the presence of tombstones
1. Check logs:
We need to look into system.log/debug.log for tombstone-related warnings/errors:
- tombstone_warn_threshold (default: 1000): If the number of tombstones scanned by a query exceeds this number, Cassandra will log a warning (which will likely propagate to your monitoring system and send you an alert).
- tombstone_failure_threshold (default: 100000): if the number of tombstones scanned by a query exceeds this number, Cassandra will abort the query. This is a mechanism to prevent one or more nodes from running out of memory and crashing.
2. Look into raw SSTables (Sorted Strings Tables):
We can use sstable2json (Cassandra < 3.0) or sstabledump (Cassandra >= 3.0) to dump a JSON representation of an SSTable to console or file. For instance:
This is especially useful for distinguishing between null values for columns and columns which haven’t been set.
Use SSTablemetadata to know the tombstone ratio of a table or to have an estimated distribution of the tombstone drop time (ie. delete timestamp) on any SSTable.
Note: Inserting null values in Cassandra creates tombstones.
You can enable tracing using the command TRACING ON on the cqlsh prompt. After that, each query run at the prompt will be traced, and the trace output will be displayed. We can find the number of scanned tombstones in trace output.
4. Nodetool tablestats:
This shows the number of tombstones recently encountered in each table.
Now it’s time to get rid of those tombstones
First, avoid the insertion of tombstones.
In reality, there are many other things causing tombstones apart from issuing DELETE statements. Inserting null values, inserting collections and expiring data using TTL are common sources of tombstones.
Under most circumstances, the best approach is to wait for the tombstone to compact away normally, but if it’s taking a toll on your database performance, or causing issues, we can take some steps to evict them more efficiently.
Check for repairs
If full or incremental repairs have been run on the cluster in the past, but are no longer running, there may be a mix of repaired and unrepaired SSTables that will never be compacted together. Set all SSTables as unrepaired with the sstablerepairedset utility. For more details, read this post: Incremental Repair: Problems and a Solution.
User-defined compactions (UDC)
User-defined compactions allow us to manually select which files should be compacted. This enables us to reclaim space and limit the size of compaction so it can fit into the remaining space. These compactions are relevant only for SizeTieredCompactionStrategy (STCS) and are most useful in specific situations.
For more information, you can check this post: How to Perform (UDC) User-Defined Compactions in Cassandra.
The nodetool garbagecollect command is available from Cassandra 3.10 onwards. This command runs a series of smaller compactions that also check overlapping SSTables. It’s CPU intensive and time-consuming and removes only the expired tombstones.
We need to be careful while modifying gc_grace_seconds because prematurely removing tombstones can result in the resurrection of deleted data. Tombstones will only be removed if gc_grace_seconds have elapsed since the tombstones were created. The intended purpose of gc_grace_seconds is to provide time for repairs to restore consistency to the cluster.
This command can lead to the creation of one huge SSTable that will never have peers to compact with. To avoid having a large SSTable, use –split-output option on Cassandra version >2.1.
I hope this was helpful. If you have any questions or thoughts, please leave them in the comments.