Link Spark with iPython Notebook
I have followed some tutorial online but they do not work with Spark 1.5.1
on OS X El Capitan (10.11)
Basically I have run this commands download apache-spark
brew update
brew install scala
brew install apache-spark
updated the .bash_profile
# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
export SPARK_HOME="/usr/local/Cellar/apache-spark/1.5.1/libexec/"
export PYSPARK_SUBMIT_ARGS="--master local[2]"
fi
run
ipython profile create pyspark
created a startup file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py
configured in this way
# Configure the necessary Spark environment
import os
import sys
# Spark home
spark_home = os.environ.get("SPARK_HOME")
# If Spark V1.4.x is detected, then add ' pyspark-shell' to
# the end of the 'PYSPARK_SUBMIT_ARGS' environment variable
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.4" in open(spark_release_file).read():
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, "python/pyspark/shell.py"))
I then run ipython notebook --profile=pyspark
and the notebook works fine, but the sc
(spark context) is not recognised.
Anyone managed to do this with Spark 1.5.1
?
EDIT: you can follow this guide to have it working
https://gist.github.com/tommycarpi/f5a67c66a8f2170e263c
I have Jupyter installed, and indeed It is simpler than you think:
- Install anaconda for OSX.
-
Install jupyter typing the next line in your terminal Click me for more info.
ilovejobs@mymac:~$ conda install jupyter
-
Update jupyter just in case.
ilovejobs@mymac:~$ conda update jupyter
-
Download Apache Spark and compile it, or download and uncompress Apache Spark 1.5.1 + Hadoop 2.6.
ilovejobs@mymac:~$ cd Downloads ilovejobs@mymac:~/Downloads$ wget http://www.apache.org/dyn/closer.lua/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz
-
Create an
Apps
folder on your home (i.e):ilovejobs@mymac:~/Downloads$ mkdir ~/Apps
-
Move the uncompressed folder
spark-1.5.1
to the~/Apps
directory.ilovejobs@mymac:~/Downloads$ mv spark-1.5.1/ ~/Apps
-
Move to the
~/Apps
directory and verify that spark is there.ilovejobs@mymac:~/Downloads$ cd ~/Apps ilovejobs@mymac:~/Apps$ ls -l drwxr-xr-x ?? ilovejobs ilovejobs 4096 ?? ?? ??:?? spark-1.5.1
-
Here is the first tricky part. Add the spark binaries to your
$PATH
:ilovejobs@mymac:~/Apps$ cd ilovejobs@mymac:~$ echo "export $HOME/apps/spark/bin:$PATH" >> .profile
-
Here is the second tricky part. Add this environment variables also:
ilovejobs@mymac:~$ echo "export PYSPARK_DRIVER_PYTHON=ipython" >> .profile ilovejobs@mymac:~$ echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark" >> .profile
-
Source the profile to make these variables available for this terminal
ilovejobs@mymac:~$ source .profile
-
Create a
~/notebooks
directory.ilovejobs@mymac:~$ mkdir notebooks
-
Move to
~/notebooks
and run pyspark:ilovejobs@mymac:~$ cd notebooks ilovejobs@mymac:~/notebooks$ pyspark
Notice that you can add those variables to the .bashrc
located in your home.
Now be happy, You should be able to run jupyter with a pyspark kernel (It will show it as a python 2 but it will use spark)
First, make sure you have got a spark enviornment in your machine.
Then, install a python module findspark
via pip:
$ sudo pip install findspark
And then in the python shell:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
Now you can do what you want with pyspark in the python shell(or in ipython).
Actually it's the easiest way in my view to use spark kernel in the jupyter
FYI, you can run Scala, PySpark, SparkR, and SQL with Spark running on top of Jupyter via https://github.com/ibm-et/spark-kernel now. The new interpreters were added (and marked experimental) from pull request https://github.com/ibm-et/spark-kernel/pull/146.
See the language support wiki page for more information.