Large scale data processing Hbase vs Cassandra [closed]

As a Cassandra developer, I'm better at answering the other side of the question:

  • Cassandra scales better. Cassandra is known to scale to over 400 nodes in a cluster; when Facebook deployed Messaging on top of HBase they had to shard it across 100-node HBase sub-clusters.
  • Cassandra supports hundreds, even thousands of ColumnFamilies. "HBase currently does not do well with anything above two or three column families."
  • As a fully distributed system with no "special" nodes or processes, Cassandra is simpler to set up and operate, easier to troubleshoot, and more robust.
  • Cassandra's support for multi-master replication means that not only do you get the obvious power of multiple datacenters -- geographic redundancy, local latencies -- but you can also split realtime and analytical workloads into separate groups, with realtime, bidirectional replication between them. If you don't split those workloads apart they will contend spectacularly.
  • Because each Cassandra node manages its own local storage, Cassandra has a substantial performance advantage that is unlikely to be narrowed significantly. (E.g., it's standard practice to put the Cassandra commitlog on a separate device so it can do its sequential writes unimpeded by random i/o from read requests.)
  • Cassandra allows you to choose how strong you want it to require consistency to be on a per-operation basis. Sometimes this is misunderstood as "Cassandra does not give you strong consistency," but that is incorrect.
  • Cassandra offers RandomPartitioner as well as the more Bigtable-like OrderedPartitioner. RandomPartitioner is much less prone to hot spots.
  • Cassandra offers on- or off-heap caching with performance comparable to memcached, but without the cache consistency problems or complexity of requiring extra moving parts
  • Non-Java clients are not second-class citizens

To my knowledge, the main advantage HBase has right now (HBase 0.90.4 and Cassandra 0.8.4) is that Cassandra does not yet support transparent data compression. (This has been added for Cassandra 1.0, due in early October, but today that is a real advantage for HBase.) HBase may also be better optimized for the kinds of range scans done by Hadoop batch processing.

There are also some things that are not necessarily better, or worse, just different. HBase adheres more strictly to the Bigtable data model, where each column is versioned implicitly. Cassandra drops versioning, and adds SuperColumns instead.

Hope that helps!


Trying to determine which is best for you really depends on what you are going to use it for, they each have their advantages and without any more details it becomes more of a religious war. That post you referenced is also more than a year old and both have gone through many changes since then. Please also keep in mind I am not familiar with the more recent Cassandra developments.

Having said that, I'll paraphrase HBase committer Andrew Purtell and add some of my own experiences:

  • HBase is in larger production environments (1000 nodes) although that is still in the ballpark of Cassandra's ~400 node installs so its really a marginal difference.

  • HBase and Cassandra both supports replication between clusters/datacenters. I believe HBase's exposes more to the user so it appears more complicated but then you also get more flexibility.

  • If strong consistency is what your application needs then HBase is likely a better fit. It is designed from the ground up to be consistent. For example it allows for simpler implementation of atomic counters (I think Cassandra just got them) as well as Check and Put operations.

  • Write performance is great, from what I understand that was one of the reasons Facebook went with HBase for their messenger.

  • I'm not sure of the current state of Cassandra's ordered partitioner, but in the past it required manual rebalancing. HBase handles that for you if you want. The ordered partitioner is important for Hadoop style processing.

  • Cassandra and HBase are both complex, Cassandra just hides it better. HBase exposes it more via using HDFS for its storage, if you look at the codebase Cassandra is just as layered. If you compare the Dynamo and Bigtable papers you can see that Cassandra's theory of operation is actually more complex.

  • HBase has more unit tests FWIW.

  • All Cassandra RPC is Thrift, HBase has a Thrift, REST and native Java. The Thrift and REST do only offer a subset of the total client API but if you want pure speed the native Java client is there.

  • There are advantages to both peer to peer and master to slave. The master - slave setup generally makes it easier to debug and reduces quite a bit of complexity.

  • HBase is not tied to only traditional HDFS, you can change out your underlying storage depending on your needs. MapR looks quite interesting and I have heard good things although I have not used it myself.


The reason for using 100 node hBase clusters is not because HBase does not scale to larger sizes. It is because it is easier to do hBase/HDFS software upgrades on a rolling fashion without bringing down your entire service. Another reason is to prevent a single NameNode to be a SPOF for the entire service. Also, HBase is being used for various services (not just FB messages) and it is prudent to have a cookie-cutter approach to setting up numerous HBase clusters based on a 100-node pod approach. The number 100 is adhoc, we have not focussed on whether 100 is optimal or not.