How to load jar dependenices in IPython Notebook

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:

export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"

These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:

packages = "com.databricks:spark-csv_2.11:1.3.0"

os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages {0} pyspark-shell".format(packages)
)

I believe you can also add this as a variable to your spark-defaults.conf file. So something like:

spark.jars.packages    com.databricks:spark-csv_2.10:1.3.0

This will load the spark-csv library into PySpark every time you launch the driver.

Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'

from pyspark import SparkContext, SparkConf

This way you are only importing the packages you actually need for your script.

How to load jar dependenices in IPython Notebook

Related

Recent Posts