Cassandra: understanding replication factor
Hypothetical situation:
- Setup a cassandra cluster with N nodes.
- Create a keyspace and set
replication_factor
to 1 and useSimpleStrategy
. - Add some data.
- Remove 1 node.
Does this mean that 1/N of the data is now missing?
For read requests, Yes that's what it means. A replication factor of 1 is generally something you don't want with cassandra (unless you have a single node).
Higher replication factors would give you better resilience but the main parameter that determine availability of rows is actually the consistency level (which is query specific).
For write requests, the ANY consistency level would make the cluster kind of a kind the request even if the selected target for the row is missing (as it would try to use hinted handoff to commit writes later).
You didn't tell us how the node is removed, if you use the nodetool command then data on the node will be sent to other nodes before its removal. So you'll keep your data.
See http://wiki.apache.org/cassandra/Operations#Removing_nodes_entirely
If your node crash :
- for read requests, your data are lost
- for write requests:
- for short issue like a network failure, your cluster (the coordinator of each request) will handle data of this node until it reappears, using the HintedHandoff feature
- for longer or permanent issue, you need to reorganize your cluster to ensure again the 1/N in the correct way, see http://wiki.apache.org/cassandra/Operations#For_versions_1.2.0_and_above