Failed to find data source: Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide"

Hello I am trying to use pyspark + kafka in order to do this I execute this command in order to set up the kafka-cluster

zookeeper-server-start.sh $KAFKA_HOME/../config/zookeeper.properties

kafka-server-start.sh $KAFKA_HOME/../config/*-0.properties & kafka-server-start.sh $KAFKA_HOME/../config/*-1.properties
  • Spark version is - spark-3.2.0-bin-hadoop2-7
  • Kafka version is - kafka_2.13-3.0.0
  • pyspark version is 3.2.0

The python code is:

spark_version = '3.2.0'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:{}'.format(spark_version)

spark = SparkSession \
    .builder \
    .appName("TP3") \
    .getOrCreate()

!spark-submit --class TP3 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 TweetCount.ipynb

This return the following error:

Error: Failed to load class TP3.

And when I execute spark.readStream

consumer = KafkaConsumer('topic')
df_kafka = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", 'localhost:9092') \
    .option("subscribe", 'topic') \
    .load()

And I got this error:

Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".

How can I execute the readstream in order to read from kafka with pyspark?

Thanks


Solution 1:

Finally, I solved using the following code at the beginning of the notebook.

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.2.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 pyspark-shell'