importing pyspark in python shell

Solution 1:

Assuming one of the following:

  • Spark is downloaded on your system and you have an environment variable SPARK_HOME pointing to it
  • You have ran pip install pyspark

Here is a simple method (If you don't bother about how it works!!!)

Use findspark

  1. Go to your python shell

    pip install findspark
    
    import findspark
    findspark.init()
    
  2. import the necessary modules

    from pyspark import SparkContext
    from pyspark import SparkConf
    
  3. Done!!!

Solution 2:

If it prints such error:

ImportError: No module named py4j.java_gateway

Please add $SPARK_HOME/python/build to PYTHONPATH:

export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

Solution 3:

Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark :

export SPARK_HOME=/some/path/to/apache-spark
# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

I added this line to my .bashrc file and the modules are now correctly found!

Solution 4:

By exporting the SPARK path and the Py4j path, it started to work:

export SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH 
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

So, if you don't want to type these everytime you want to fire up the Python shell, you might want to add it to your .bashrc file

Solution 5:

Don't run your py file as: python filename.py instead use: spark-submit filename.py

Source: https://spark.apache.org/docs/latest/submitting-applications.html