Add Jar to standalone pyspark
2021-01-19 Updated
There are many approaches here (setting ENV vars, adding to $SPARK_HOME/conf/spark-defaults.conf, etc...) other answers already cover these. I wanted to add an answer for those specifically wanting to do this from within a Python Script or Jupyter Notebook.
When you create the Spark session you can add a .config() that pulls in the specific Jar file (in my case I wanted the Kafka package loaded):
spark = SparkSession.builder.appName('my_awesome')\
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\
.getOrCreate()
Using this line of code I didn't need to do anything else (no ENVs or conf file changes).
- Note 1: The JAR file will dynamically download, you don't need to manually download it.
- Note 2: Make sure the versions match what you want, so in the example above my Spark version is 3.0.1 so I have
:3.0.1
at the end.
Any dependencies can be passed using spark.jars.packages
(setting spark.jars
should work as well) property in the $SPARK_HOME/conf/spark-defaults.conf
. It should be a comma separated list of coordinates.
And packages or classpath properties have to be set before JVM is started and this happens during SparkConf
initialization. It means that SparkConf.set
method cannot be used here.
Alternative approach is to set PYSPARK_SUBMIT_ARGS
environment variable before SparkConf
object is initialized:
import os
from pyspark import SparkConf
SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
sc = SparkContext(conf=conf)