Cassandra is a distributed, decentralized, fault-tolerant system. Data is replicated throughout multiple nodes (centers) across various data centers. The fact that Cassandra is decentralized means that it can survive single or even multi-node failures without losing any data. With Cassandra, there is no single point of failure, making Cassandra a highly available database.
As long as there is one node containing the data, Cassandra can recover the data without resorting to an external backup. If set up right, Cassandra will be able to handle disk or other hardware failures even in the case of an entire data center going down.
However, Cassandra backups are still necessary to recover from the following scenarios:
- Errors made in data updates by client applications
- Accidental deletions
- Catastrophic failures that require the entire cluster to be rebuilt
- Data corruption
- A desire to rollback cluster to a previous known good state
Setting up a backup strategy
When setting up your backup strategy, you should consider some points:
- Secondary storage footprint: Backup footprints can be much larger than the live database setup depending on the frequency of backups and retention period. It is therefore vital to create an efficient storage solution that decreases storage CAPEX (capital expenditure) as much as possible.
- Recovery point objective (RPO): The maximum targeted period in which data might be lost from service due to a significant incident.
- Recovery time objective (RTO): The targeted duration of time and a Service Level Agreement within which a backup must be restored after a disaster/disruption to avoid unacceptable consequences associated with a break in business continuity.
- Backup performance: The backup performance should be sufficient enough to at least match the data change rate in the Cassandra Cluster.
Backup alternatives
Snapshot-based backups
The purpose of a snapshot is to make a copy of all or part of keyspaces and tables in a node and to save it into a separate file. When you take a snapshot, Cassandra first performs a flush to push any data residing in the memtables into the disk (SStables), and then makes a hard link to each SSTable file.
Each snapshot contains a manifest.json file that lists the SSTable files included in the snapshot to make sure that the entire contents of the snapshot are present.
Nodetool snapshot operates at the node level, meaning that you will need to run it at the same time on multiple nodes.
Incremental backups
When incremental backups are enabled, Cassandra creates backups as part of the process of flushing SSTables to disk. The backup consists of a hard link to each data file that is stored in a backup directory. In Cassandra, incremental backups contain only new SStables files, making them dependent on the last snapshot created. Files created due to compaction are not hard linked.
Incremental backups in combination with snapshot
By combining both methods, you can achieve a better granularity of the backups. Data is backed up periodically via the snapshot, and incremental backup files are used to obtain granularity between scheduled snapshots.
Commit log backup in combination with snapshot
This approach is a similar method to the incremental backup with snapshots. Rather than relying on incremental backups to backup newly added SStables, commit logs are archived. As with the previous solution, snapshots provide the bulk of backup data, while the archive of commit log is used for point-in-time backup.
Commit log backup in combination with snapshot and incremental
In addition to incremental backups, commit logs are archived. This process relies on a feature called Commitlog Archiving. Like with the previous solution, snapshots provide the bulk of backup data, incremental complement and the archive of commit log used for point-in-time backup.
Due to the nature of commit logs, it is not possible to restore commit logs to a different node other than the one it was backed up from. This limitation restricts the scope of restoring commit logs in case of catastrophic hardware failure. (And a node is not fully restored, only its data.)
Datacenter backup
With this setup, Cassandra will stream data to the backup as it is added. This mechanism prevents cumbersome snapshot-based backups requiring files stored on a network. However, this will not protect from a developer mistake (e.g., deletion of data), unless there is a time buffer between both data centers.
Backup options comparison
No comments