What are the differences between a node, a cluster and a datacenter in a cassandra nosql database?

The hierarchy of elements in Cassandra is:

  • Cluster
    • Data center(s)
      • Rack(s)
        • Server(s)
          • Node (more accurately, a vnode)

A Cluster is a collection of Data Centers.

A Data Center is a collection of Racks.

A Rack is a collection of Servers.

A Server contains 256 virtual nodes (or vnodes) by default.

A vnode is the data storage layer within a server.

Note: A server is the Cassandra software. A server is installed on a machine, where a machine is either a physical server, an EC2 instance, or similar.

Now to specifically address your questions.

An individual unit of data is called a partition. And yes, partitions are replicated across multiple nodes. Each copy of the partition is called a replica.

In a multi-data center cluster, the replication is per data center. For example, if you have a data center in San Francisco named dc-sf and another in New York named dc-ny then you can control the number of replicas per data center.

As an example, you could set dc-sf to have 3 replicas and dc-ny to have 2 replicas.

Those numbers are called the replication factor. You would specifically say dc-sf has a replication factor of 3, and dc-ny has a replication factor of 2. In simple terms, dc-sf would have 3 copies of the data spread across three vnodes, while dc-sf would have 2 copies of the data spread across two vnodes.

While each server has 256 vnodes by default, Cassandra is smart enough to pick vnodes that exist on different physical servers.

To summarize:

  • Data is replicated across multiple virtual nodes (each server contains 256 vnodes by default)
  • Each copy of the data is called a replica
  • The unit of data is called a partition
  • Replication is controlled per data center

A node is a single machine that runs Cassandra. A collection of nodes holding similar data are grouped in what is known as a "ring" or cluster.

Sometimes if you have a lot of data, or if you are serving data in different geographical areas, it makes sense to group the nodes of your cluster into different data centers. A good use case of this, is for an e-commerce website, which may have many frequent customers on the east coast and the west coast. That way your customers on the east coast connect to your east coast DC (for faster performance), but ultimately have access to the same dataset (both DCs are in the same cluster) as the west coast customers.

More information on this can be found here: About Apache Cassandra- How does Cassandra work?

And all the nodes that contains the same (duplicated) data compose a datacenter. Is that right?

Close, but not necessarily. The level of data duplication you have is determined by your replication factor, which is set on a per-keyspace basis. For instance, let's say that I have 3 nodes in my single DC, all storing 600GB of product data. My products keyspace definition might look like this:

CREATE KEYSPACE products
WITH replication = {'class': 'NetworkTopologyStrategy', 'MyDC': '3'};

This will ensure that my product data is replicated equally to all 3 nodes. The size of my total dataset is 600GB, duplicated on all 3 nodes.

But let's say that we're rolling-out a new, fairly large product line, and I estimate that we're going to have another 300GB of data coming, which may start pushing the max capacity of our hard drives. If we can't afford to upgrade all of our hard drives right now, I can alter the replication factor like this:

CREATE KEYSPACE products
WITH replication = {'class': 'NetworkTopologyStrategy', 'MyDC': '2'};

This will create 2 copies of all of our data, and store it in our current cluster of 3 nodes. The size of our dataset is now 900GB, but since there are only two copies of it (each node is essentially responsible for 2/3 of the data) our size on-disk is still 600GB. The drawback here, is that (assuming I read and write at a consistency level of ONE) I can only afford to suffer a loss of 1 node. Whereas with 3 nodes and a RF of 3 (again reading and writing at consistency ONE), I could lose 2 nodes and still serve requests.

Edit 20181128

When I make a network request am I making that against the server? or the node? Or I make a request against the server does it then route it and read from the node or something else?

So real quick explanation: server == node

As far as making a request against the nodes in your cluster, that behavior is actually dictated from the driver on the application side. In fact, the driver maintains a copy of the current network topology, as it reads the cluster gossip similar to how the nodes do.

On the application side, you can set a load balancing policy. Specifically, the TokenAwareLoadBalancingPolicy class will examine the partition key of each request, figure out which node(s) has the data, and send the request directly there.

For the other load balancing policies, or for queries where a single partition key cannot be determined, the request will be sent to a single node. This node will act as a "coordinator." This chosen node will handle the routing of requests to the nodes responsible for them, as well as the compilation/returning of any result sets.


Node:

A machine which stores some portion of your entire database. This may included data replicated from another node as well as it's own data. What data it is responsible for is determined by it's token ranges, and the replication strategy of the keyspace holding the data.

Datacenter:

A logical grouping of Nodes which can be separated from another nodes. A common use case is AWS-EAST vs AWS-WEST. The replication NetworkTopologyStrategy is used to specify how many replicas of the entire keyspace should exist in any given datacenter. This is how Cassandra users achieve cross-dc replication. In addition their are Consistency Level policies that only require acknowledgement only within the Datacenter of the coordinator (LOCAL_*)

Cluster

The sum total of all the machines in your database including all datacenters. There is no cross-cluster replication.