How to manually set group.id and commit kafka offsets in spark structured streaming?

tl;dr

It is not possible to commit any messages to Kafka. Starting with Spark version 3.x you can define the name of the Kafka consumer group, however, this still does not allow you to commit any messages.


Since Spark 3.0.0

According to the Structured Kafka Integration Guide you can provide the ConsumerGroup as an option kafka.group.id:

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .option("kafka.group.id", "myConsumerGroup")
  .load()

However, Spark still will not commit any offsets back so you will not be able to "manually" commit offsets to Kafka. This feature is meant to deal with Kafka's latest feature Authorization using Role-Based Access Control for which your ConsumerGroup usually needs to follow naming conventions.

A full example of a Spark 3.x application is discussed and solved here.

Until Spark 2.4.x

The Spark Structured Streaming + Kafka integration Guide clearly states how it manages Kafka offsets. Spark will not commit any messages back to Kafka as it is relying on internal offset management for fault-tolerance.

The most important Kafka configurations for managing offsets are:

  • group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will be set to
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
  • auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it.
  • enable.auto.commit: Kafka source doesn’t commit any offset.

Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).

2.4.x in Action

Let's say you have a simple Spark Structured Streaming application that reads and writes to Kafka, like this:

// create SparkSession
val spark = SparkSession.builder()
  .appName("ListenerTester")
  .master("local[*]")
  .getOrCreate()

// read from Kafka topic
val df = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "testingKafkaProducer")
  .option("failOnDataLoss", "false")
  .load()

// write to Kafka topic and set checkpoint directory for this stream
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("topic", "testingKafkaProducerOut")
  .option("checkpointLocation", "/home/.../sparkCheckpoint/")
  .start()

Offset Management by Spark

Once this application is submitted and data is being processed, the corresponding offset can be found in the checkpoint directory:

myCheckpointDir/offsets/

{"testingKafkaProducer":{"0":1}}

Here the entry in the checkpoint file confirms that the next offset of partition 0 to be consumed is 1. It implies that the application already processes offset 0 from partition 0 of the topic named testingKafkaProducer.

More on the fault-tolerance-semantics are given in the Spark Documentation.

Offset Management by Kafka

However, as stated in the documentation, the offset is not committed back to Kafka. This can be checked by executing the kafka-consumer-groups.sh of the Kafka installation.

./kafka/current/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "spark-kafka-source-92ea6f85-[...]-driver-0"

TOPIC                PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-ID      HOST         CLIENT-ID
testingKafkaProducer 0          -               1               -    consumer-1-[...] /127.0.0.1   consumer-1

The current offset for this application is unknown to Kafka as it has never been committed.

Possible Workaround

Please carefully read the comments below from Spark committer @JungtaekLim about the workaround: "Spark's fault tolerance guarantee is based on the fact Spark has a full control of offset management, and they're voiding the guarantee if they're trying to modify it. (e.g. If they change to commit offset to Kafka, then there's no batch information and if Spark needs to move back to the specific batch "behind" guarantee is no longer valid.)"

What I have seen doing some research on the web is that you could commit offsets in the callback function of the onQueryProgress method in a customized StreamingQueryListener of Spark. That way, you could have a consumer group that keeps track of the current progress. However, its progress is not necessarily aligned with the actual consumer group.

Here are some links you may find helpful:

  • Code Example for Listener

  • Discussion on SO around offset management

  • General description on the StreamingQueryListener