Difference between session.timeout.ms and max.poll.interval.ms for Kafka >= 0.10.1

Solution 1:

Before KIP-62, there is only session.timeout.ms (ie, Kafka 0.10.0 and earlier). max.poll.interval.ms is introduced via KIP-62 (part of Kafka 0.10.1).

KIP-62, decouples heartbeats from calls to poll() via a background heartbeat thread, allowing for a longer processing time (ie, time between two consecutive poll()) than heartbeat interval.

Assume processing a message takes 1 minute. If heartbeat and poll are coupled (ie, before KIP-62), you will need to set session.timeout.ms larger than 1 minute to prevent consumer to time out. However, if a consumer dies, it also takes longer than 1 minute to detect the failed consumer.

KIP-62 decouples polling and heartbeat allowing to send heartbeats between two consecutive polls. Now you have two threads running, the heartbeat thread and the processing thread and thus, KIP-62 introduced a timeout for each. session.timeout.ms is for the heartbeat thread while max.poll.interval.ms is for the processing thread.

Assume, you set session.timeout.ms=30000, thus, the consumer heartbeat thread must sent a heartbeat to the broker before this time expires. On the other hand, if processing of a single message takes 1 minutes, you can set max.poll.interval.ms larger than one minute to give the processing thread more time to process a message.

If the processing thread dies, it takes max.poll.interval.ms to detect this. However, if the whole consumer dies (and a dying processing thread most likely crashes the whole consumer including the heartbeat thread), it takes only session.timeout.ms to detect it.

The idea is, to allow for a quick detection of a failing consumer even if processing itself takes quite long.

Implemenation Detail

The new timeout max.poll.interval.ms is mainly a client side concept: if poll() is not called within max.poll.interval.ms, the heartbeat thread will detect this case and send a leave-group request to the broker. -- max.poll.interval.ms is still relevant for consumer group rebalances: if a rebalance is triggered, consumers have max.poll.interval.ms time to re-join the group by calling poll() client side which triggers a join-group request.