Pyspark: Exception: Java gateway process exited before sending the driver its port number

I'm trying to run pyspark on my macbook air. When i try starting it up I get the error:

Exception: Java gateway process exited before sending the driver its port number

when sc = SparkContext() is being called upon startup. I have tried running the following commands:

./bin/pyspark
./bin/spark-shell
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

with no avail. I have also looked here:

Spark + Python - Java gateway process exited before sending the driver its port number?

but the question has never been answered. Please help! Thanks.

Solution 1:

One possible reason is JAVA_HOME is not set because java is not installed.

I encountered the same issue. It says

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/spark/launcher/Main : Unsupported major.minor version 51.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:643)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
    at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:296)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:406)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/spark/python/pyspark/conf.py", line 104, in __init__
    SparkContext._ensure_initialized()
  File "/opt/spark/python/pyspark/context.py", line 243, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway()
  File "/opt/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
    raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number

at sc = pyspark.SparkConf(). I solved it by running

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

which is from https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-get-on-ubuntu-16-04

Solution 2:

this should help you

One solution is adding pyspark-shell to the shell environment variable PYSPARK_SUBMIT_ARGS:

export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

There is a change in python/pyspark/java_gateway.py , which requires PYSPARK_SUBMIT_ARGS includes pyspark-shell if a PYSPARK_SUBMIT_ARGS variable is set by a user.

Solution 3:

Had this error message running pyspark on Ubuntu, got rid of it by installing the openjdk-8-jdk package

from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("local"))
^^^ error

Install Open JDK 8:

apt-get install openjdk-8-jdk-headless -qq

On MacOS

Same on Mac OS, I typed in a terminal:

$ java -version
No Java runtime present, requesting install.

I was prompted to install Java from the Oracle's download site, chose the MacOS installer, clicked on jdk-13.0.2_osx-x64_bin.dmg and after that checked that Java was installed

$ java -version
java version "13.0.2" 2020-01-14

EDIT To install JDK 8 you need to go to https://www.oracle.com/java/technologies/javase-jdk8-downloads.html (login required)

After that I was able to start a Spark context with pyspark.

Checking if it works

In Python:

from pyspark import SparkContext 
sc = SparkContext.getOrCreate() 

# check that it really works by running a job
# example from http://spark.apache.org/docs/latest/rdd-programming-guide.html#parallelized-collections
data = range(10000) 
distData = sc.parallelize(data)
distData.filter(lambda x: not x&1).take(10)
# Out: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Note that you might need to set the environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON and they have to be the same Python version as the Python (or IPython) you're using to run pyspark (the driver).

Solution 4:

I use Mac OS. I fixed the problem!

Below is how I fixed it.

JDK8 seems works fine. (https://github.com/jupyter/jupyter/issues/248)

So I checked my JDK /Library/Java/JavaVirtualMachines, I only have jdk-11.jdk in this path.

I downloaded JDK8 (I followed the link). Which is:

brew tap caskroom/versions
brew cask install java8

After this, I added

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home
export JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"

to ~/.bash_profile file. (you sholud check your jdk1.8 file name)

It works now! Hope this help :)

Solution 5:

I will repost how I solved it here just for future references.

How I solved my similar problem

Prerequisite:

anaconda already installed
Spark already installed (https://spark.apache.org/downloads.html)
pyspark already installed (https://anaconda.org/conda-forge/pyspark)

Steps I did (NOTE: set the folder path accordingly to your system)

set the following environment variables.

SPARK_HOME to 'C:\spark\spark-3.0.1-bin-hadoop2.7'

set HADOOP_HOME to 'C:\spark\spark-3.0.1-bin-hadoop2.7'

set PYSPARK_DRIVER_PYTHON to 'jupyter'

set PYSPARK_DRIVER_PYTHON_OPTS to 'notebook'

add 'C:\spark\spark-3.0.1-bin-hadoop2.7\bin;' to PATH system variable.

Change the java installed folder directly under C: (Previously java was installed under Program files, so I re-installed directly under C:)

so my JAVA_HOME will become like this 'C:\java\jdk1.8.0_271'

now. it works !