importing pyspark in python shell
Solution 1:
Assuming one of the following:
- Spark is downloaded on your system and you have an environment variable
SPARK_HOME
pointing to it - You have ran
pip install pyspark
Here is a simple method (If you don't bother about how it works!!!)
Use findspark
-
Go to your python shell
pip install findspark import findspark findspark.init()
-
import the necessary modules
from pyspark import SparkContext from pyspark import SparkConf
-
Done!!!
Solution 2:
If it prints such error:
ImportError: No module named py4j.java_gateway
Please add $SPARK_HOME/python/build to PYTHONPATH:
export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
Solution 3:
Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark
:
export SPARK_HOME=/some/path/to/apache-spark
# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
I added this line to my .bashrc file and the modules are now correctly found!
Solution 4:
By exporting the SPARK path and the Py4j path, it started to work:
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
So, if you don't want to type these everytime you want to fire up the Python shell, you might want to add it to your .bashrc
file
Solution 5:
Don't run your py file as: python filename.py
instead use: spark-submit filename.py
Source: https://spark.apache.org/docs/latest/submitting-applications.html