Link Spark with iPython Notebook

I have followed some tutorial online but they do not work with Spark 1.5.1 on OS X El Capitan (10.11)

Basically I have run this commands download apache-spark

brew update
brew install scala
brew install apache-spark

updated the .bash_profile

# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/1.5.1/libexec/"
  export PYSPARK_SUBMIT_ARGS="--master local[2]"
fi

run

ipython profile create pyspark

created a startup file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py configured in this way

# Configure the necessary Spark environment
import os
import sys

# Spark home
spark_home = os.environ.get("SPARK_HOME")

# If Spark V1.4.x is detected, then add ' pyspark-shell' to
# the end of the 'PYSPARK_SUBMIT_ARGS' environment variable
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.4" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, "python/pyspark/shell.py"))

I then run ipython notebook --profile=pyspark and the notebook works fine, but the sc (spark context) is not recognised.

Anyone managed to do this with Spark 1.5.1?

EDIT: you can follow this guide to have it working

https://gist.github.com/tommycarpi/f5a67c66a8f2170e263c

I have Jupyter installed, and indeed It is simpler than you think:

Install anaconda for OSX.
Install jupyter typing the next line in your terminal Click me for more info.
```
ilovejobs@mymac:~$ conda install jupyter
```

Update jupyter just in case.

ilovejobs@mymac:~$ conda update jupyter

Download Apache Spark and compile it, or download and uncompress Apache Spark 1.5.1 + Hadoop 2.6.

ilovejobs@mymac:~$ cd Downloads 
ilovejobs@mymac:~/Downloads$ wget http://www.apache.org/dyn/closer.lua/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz

Create an Apps folder on your home (i.e):

ilovejobs@mymac:~/Downloads$ mkdir ~/Apps

Move the uncompressed folder spark-1.5.1 to the ~/Apps directory.
```
ilovejobs@mymac:~/Downloads$ mv spark-1.5.1/ ~/Apps
```

Move to the ~/Apps directory and verify that spark is there.

ilovejobs@mymac:~/Downloads$ cd ~/Apps
ilovejobs@mymac:~/Apps$ ls -l
drwxr-xr-x ?? ilovejobs ilovejobs 4096 ?? ?? ??:?? spark-1.5.1

Here is the first tricky part. Add the spark binaries to your $PATH:

ilovejobs@mymac:~/Apps$ cd
ilovejobs@mymac:~$ echo "export $HOME/apps/spark/bin:$PATH" >> .profile

Here is the second tricky part. Add this environment variables also:

ilovejobs@mymac:~$ echo "export PYSPARK_DRIVER_PYTHON=ipython" >> .profile
ilovejobs@mymac:~$ echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark" >> .profile

Source the profile to make these variables available for this terminal
```
ilovejobs@mymac:~$ source .profile
```
Create a ~/notebooks directory.
```
ilovejobs@mymac:~$ mkdir notebooks
```

Move to ~/notebooks and run pyspark:

ilovejobs@mymac:~$ cd notebooks
ilovejobs@mymac:~/notebooks$ pyspark

Notice that you can add those variables to the .bashrc located in your home. Now be happy, You should be able to run jupyter with a pyspark kernel (It will show it as a python 2 but it will use spark)

First, make sure you have got a spark enviornment in your machine.

Then, install a python module findspark via pip:

$ sudo pip install findspark

And then in the python shell:

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName="myAppName")

Now you can do what you want with pyspark in the python shell(or in ipython).

Actually it's the easiest way in my view to use spark kernel in the jupyter

FYI, you can run Scala, PySpark, SparkR, and SQL with Spark running on top of Jupyter via https://github.com/ibm-et/spark-kernel now. The new interpreters were added (and marked experimental) from pull request https://github.com/ibm-et/spark-kernel/pull/146.

See the language support wiki page for more information.

Link Spark with iPython Notebook

Related

Recent Posts