Difference between session.timeout.ms and max.poll.interval.ms for Kafka >= 0.10.1
Solution 1:
Before KIP-62, there is only session.timeout.ms
(ie, Kafka 0.10.0
and earlier). max.poll.interval.ms
is introduced via KIP-62 (part of Kafka 0.10.1
).
KIP-62, decouples heartbeats from calls to poll()
via a background heartbeat thread, allowing for a longer processing time (ie, time between two consecutive poll()
) than heartbeat interval.
Assume processing a message takes 1 minute. If heartbeat and poll are coupled (ie, before KIP-62), you will need to set session.timeout.ms
larger than 1 minute to prevent consumer to time out. However, if a consumer dies, it also takes longer than 1 minute to detect the failed consumer.
KIP-62 decouples polling and heartbeat allowing to send heartbeats between two consecutive polls. Now you have two threads running, the heartbeat thread and the processing thread and thus, KIP-62 introduced a timeout for each. session.timeout.ms
is for the heartbeat thread while max.poll.interval.ms
is for the processing thread.
Assume, you set session.timeout.ms=30000
, thus, the consumer heartbeat thread must sent a heartbeat to the broker before this time expires. On the other hand, if processing of a single message takes 1 minutes, you can set max.poll.interval.ms
larger than one minute to give the processing thread more time to process a message.
If the processing thread dies, it takes max.poll.interval.ms
to detect this. However, if the whole consumer dies (and a dying processing thread most likely crashes the whole consumer including the heartbeat thread), it takes only session.timeout.ms
to detect it.
The idea is, to allow for a quick detection of a failing consumer even if processing itself takes quite long.
Implemenation Detail
The new timeout max.poll.interval.ms
is mainly a client side concept: if poll()
is not called within max.poll.interval.ms
, the heartbeat thread will detect this case and send a leave-group request to the broker. -- max.poll.interval.ms
is still relevant for consumer group rebalances: if a rebalance is triggered, consumers have max.poll.interval.ms
time to re-join the group by calling poll()
client side which triggers a join-group request.