Why is Kafka pull-based instead of push-based?

Scalability was the major driving factor when we design such systems (pull vs push). Kafka is very scalable. One of the key benefits of Kafka is that it is very easy to add large number of consumers without affecting performance and without down time.

Kafka can handle events at 100k+ per second rate coming from producers. Because Kafka consumers pull data from the topic, different consumers can consume the messages at different pace. Kafka also supports different consumption models. You can have one consumer processing the messages at real-time and another consumer processing the messages in batch mode.

The other reason could be that Kafka was designed not only for single consumers like Hadoop. Different consumers can have diverse needs and capabilities.

Pull-based systems have some deficiencies like resources wasting due to polling regularly. Kafka supports a 'long polling' waiting mode until real data comes through to alleviate this drawback.


Refer to the Kafka documentation which details the particular design decision: Push vs pull

Major points that were in favor of pull are:

  1. Pull is better in dealing with diversified consumers (without a broker determining the data transfer rate for all);
  2. Consumers can more effectively control the rate of their individual consumption;
  3. Easier and more optimal batch processing implementation.

The drawback of a pull-based systems (consumers polling for data while there's no data available for them) is alleviated somewhat by a 'long poll' waiting mode until data arrives.