What is the best solution for backing up data in Cassandra? [closed]

Solution 1:

There are several different options available, and it really depends on your situation which is best, but here's a quick overview of the techniques available.

There are basically two things you are trying to protect against when backing up a data store:

  • Data loss due to failure of the underlying storage volumes
  • Data loss due invalid data being inserted or deleted from the store

Cassandra has data replication built into it's design. The replication factor configuration option tell cassandra how many copies of each record to store. A common choice for replication factor is 3 because it strikes a decent trade off between performance and durability. If you have a replication factor greater than one, you can withstand the loss of at least some nodes. There's a lot more to say on this subject, but you can read about that elsewhere.

Replication doesn't protect you against bad data changes through the API, though, since massive deletes will replicate the same as good inserts. To help you with this, cassandra offers a snapshotting feature. Basically, it hard links the data files to a snapshots folder within the data directory. This can be a pretty inexpensive approach depending on how frequently and randomly your data changes. One possible approach is to keep multiple snapshots on the disks of the cassandra machines, provided you have enough space.

If you have spare IO capacity, you could actually transfer those backups to separate machines. In my experience, this takes great a toll on both disk and network throughput.

Finally, as of Cassandra 0.7, you can configure multi-datacenter replication. Essentially, you can have multiple copies of your cluster distributed throughout the world. If you combine this with snapshots, you have quite a few different options for restoring your data when something goes wrong.

Remember that you get 1 point for making a backup and 10,000 points for restoring one. Think about how you will test your backups to make sure they can actually be restored when the time comes.