I am trying to understand ZooKeeper, how it works and what it does. Is there any application which is comparable to ZooKeeper?

If you know, then how would you describe ZooKeeper to a layman?

I have tried apache wiki, zookeeper sourceforge...but I am still not able to relate to it.

I just read thru http://zookeeper.sourceforge.net/index.sf.shtml, so aren't there more services like this? Is it as simple as just replicating a server service?


In a nutshell, ZooKeeper helps you build distributed applications.

How it works

You may describe ZooKeeper as a replicated synchronization service with eventual consistency. It is robust, since the persisted data is distributed between multiple nodes (this set of nodes is called an "ensemble") and one client connects to any of them (i.e., a specific "server"), migrating if one node fails; as long as a strict majority of nodes are working, the ensemble of ZooKeeper nodes is alive. In particular, a master node is dynamically chosen by consensus within the ensemble; if the master node fails, the role of master migrates to another node.

How writes are handled

The master is the authority for writes: in this way writes can be guaranteed to be persisted in-order, i.e., writes are linear. Each time a client writes to the ensemble, a majority of nodes persist the information: these nodes include the server for the client, and obviously the master. This means that each write makes the server up-to-date with the master. It also means, however, that you cannot have concurrent writes.

The guarantee of linear writes is the reason for the fact that ZooKeeper does not perform well for write-dominant workloads. In particular, it should not be used for interchange of large data, such as media. As long as your communication involves shared data, ZooKeeper helps you. When data could be written concurrently, ZooKeeper actually gets in the way, because it imposes a strict ordering of operations even if not strictly necessary from the perspective of the writers. Its ideal use is for coordination, where messages are exchanged between the clients.

How reads are handled

This is where ZooKeeper excels: reads are concurrent since they are served by the specific server that the client connects to. However, this is also the reason for the eventual consistency: the "view" of a client may be outdated, since the master updates the corresponding server with a bounded but undefined delay.

In detail

The replicated database of ZooKeeper comprises a tree of znodes, which are entities roughly representing file system nodes (think of them as directories). Each znode may be enriched by a byte array, which stores data. Also, each znode may have other znodes under it, practically forming an internal directory system.

Sequential znodes

Interestingly, the name of a znode can be sequential, meaning that the name the client provides when creating the znode is only a prefix: the full name is also given by a sequential number chosen by the ensemble. This is useful, for example, for synchronization purposes: if multiple clients want to get a lock on a resource, they can each concurrently create a sequential znode on a location: whoever gets the lowest number is entitled to the lock.

Ephemeral znodes

Also, a znode may be ephemeral: this means that it is destroyed as soon as the client that created it disconnects. This is mainly useful in order to know when a client fails, which may be relevant when the client itself has responsibilities that should be taken by a new client. Taking the example of the lock, as soon as the client having the lock disconnects, the other clients can check whether they are entitled to the lock.

Watches

The example related to client disconnection may be problematic if we needed to periodically poll the state of znodes. Fortunately, ZooKeeper offers an event system where a watch can be set on a znode. These watches may be set to trigger an event if the znode is specifically changed or removed or new children are created under it. This is clearly useful in combination with the sequential and ephemeral options for znodes.

Where and how to use it

A canonical example of Zookeeper usage is distributed-memory computation, where some data is shared between client nodes and must be accessed/updated in a very careful way to account for synchronization.

ZooKeeper offers the library to construct your synchronization primitives, while the ability to run a distributed server avoids the single-point-of-failure issue you have when using a centralized (broker-like) message repository.

ZooKeeper is feature-light, meaning that mechanisms such as leader election, locks, barriers, etc. are not already present, but can be written above the ZooKeeper primitives. If the C/Java API is too unwieldy for your purposes, you should rely on libraries built on ZooKeeper such as cages and especially curator.

Where to read more

Official documentation apart, which is pretty good, I suggest to read Chapter 14 of Hadoop: The Definitive Guide which has ~35 pages explaining essentially what ZooKeeper does, followed by an example of a configuration service.


Zookeeper is one of the best open source server and service that helps to reliably coordinates distributed processes. Zookeeper is a CP system (Refer CAP Theorem) that provides Consistency and Partition tolerance. Replication of Zookeeper state across all the nodes makes it an eventually consistent distributed service.

Moreover, any newly elected leader will update its followers with missing proposals or with a snapshot of the state, if the followers have many proposals missing.

Zookeeper also provides an API that is very easy to use. This blog post, Zookeeper Java API examples, has some examples if you are looking for examples.

So where do we use this? If your distributed service needs a centralized, reliable and consistent configuration management, locks, queues etc, you will find Zookeeper a reliable choice.


I understand the ZooKeeper in general but had problems with the terms "quorum" and "split brain" so maybe I can share my findings with you (I consider myself also a layman).

Let's say we have a ZooKeeper cluster of 5 servers. One of the servers will become the leader and the others will become followers.

  • These 5 servers form a quorum. Quorum simply means "these servers can vote upon who should be the leader".

  • So the voting is based on majority. Majority simply means "more than half" so more than half of the number of servers must agree for a specific server to become the leader.

  • So there is this bad thing that may happen called "split brain". A split brain is simply this, as far as I understand: The cluster of 5 servers splits into two parts, or let's call it "server teams", with maybe one part of 2 and the other of 3 servers. This is really a bad situation as if both "server teams" must execute a specific order how would you decide wich team should be preferred? They might have received different information from the clients. So it is really important to know what "server team" is still relevant and which one can/should be ignored.

  • Majority is also the reason you should use an odd number of servers. If you have 4 servers and a split brain where 2 servers seperate then both "server teams" could say "hey, we want to decide who is the leader!" but how should you decide which 2 servers you should choose? With 5 servers it's simple: The server team with 3 servers has the majority and is allowed to select the new leader.

  • Even if you just have 3 servers and one of them fails the other 2 still form the majority and can agree that one of them will become the new leader.

I realize once you think about it some time and understand the terms it's not so complicated anymore. I hope this also helps anyone in understanding these terms.


Zookeeper is a centralized open-source server for maintaining and managing configuration information, naming conventions and synchronization for distributed cluster environment. Zookeeper helps the distributed systems to reduce their management complexity by providing low latency and high availability. Zookeeper was initially a sub-project for Hadoop but now it's a top level independent project of Apache Software Foundation.

More Information