How can I run Spark on a cluster using Slurm?
I have written a program example.jar
which uses a spark context. How can I run this on a cluster which uses Slurm? This is related to https://stackoverflow.com/questions/29308202/running-spark-on-top-of-slurm but the answers are not very detailed and not on serverfault.
In order to run an application using a spark context it is first necessary to run a Slurm job which starts a master and some workers. There are some things you will have to watch out for when using Slurm:
- don't start Spark as a daemon
- make the Spark workers use only as much cores and memory as requested for the Slurm job
- in order to run master and worker in the same job you will have to branch somewhere in your script
I'm working with the Linux binaries installed to $HOME/spark-1.5.2-bin-hadoop2.6/
. Remember to replace <username>
and <shared folder>
with some valid values in the script.
#!/bin/bash
#start_spark_slurm.sh
#SBATCH --nodes=3
# ntasks per node MUST be one, because multiple slaves per work doesn't
# work well with slurm + spark in this script (they would need increasing
# ports among other things)
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=500
# Beware! $HOME will not be expanded and invalid paths will result Slurm jobs
# hanging indefinitely with status CG (completing) when calling scancel!
#SBATCH --output="/home/<username>/spark/logs/%j.out"
#SBATCH --error="/home/<username>/spark/logs/%j.err"
#SBATCH --time=01:00:00
# This section will be run when started by sbatch
if [ "$1" != 'srunning' ]; then
this=$0
# I experienced problems with some nodes not finding the script:
# slurmstepd: execve(): /var/spool/slurm/job123/slurm_script:
# No such file or directory
# that's why this script is being copied to a shared location to which
# all nodes have access to:
script=/<shared folder>/${SLURM_JOBID}_$( basename -- "$0" )
cp "$this" "$script"
# This might not be necessary on all clusters
module load scala/2.10.4 java/jdk1.7.0_25 cuda/7.0.28
export sparkLogs=$HOME/spark/logs
export sparkTmp=$HOME/spark/tmp
mkdir -p -- "$sparkLogs" "$sparkTmp"
export SPARK_ROOT=$HOME/spark-1.5.2-bin-hadoop2.6/
export SPARK_WORKER_DIR=$sparkLogs
export SPARK_LOCAL_DIRS=$sparkLogs
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_CORES=$SLURM_CPUS_PER_TASK
export SPARK_DAEMON_MEMORY=$(( $SLURM_MEM_PER_CPU * $SLURM_CPUS_PER_TASK / 2 ))m
export SPARK_MEM=$SPARK_DAEMON_MEMORY
srun "$script" 'srunning'
# If run by srun, then decide by $SLURM_PROCID whether we are master or worker
else
source "$SPARK_ROOT/sbin/spark-config.sh"
source "$SPARK_PREFIX/bin/load-spark-env.sh"
if [ "$SLURM_PROCID" -eq 0 ]; then
export SPARK_MASTER_IP=$( hostname )
MASTER_NODE=$( scontrol show hostname $SLURM_NODELIST | head -n 1 )
# The saved IP address + port is necessary alter for submitting jobs
echo "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT" > "$sparkLogs/${SLURM_JOBID}_spark_master"
"$SPARK_ROOT/bin/spark-class" org.apache.spark.deploy.master.Master \
--ip "$SPARK_MASTER_IP" \
--port "$SPARK_MASTER_PORT " \
--webui-port "$SPARK_MASTER_WEBUI_PORT"
else
# $(scontrol show hostname) is used to convert e.g. host20[39-40]
# to host2039 this step assumes that SLURM_PROCID=0 corresponds to
# the first node in SLURM_NODELIST !
MASTER_NODE=spark://$( scontrol show hostname $SLURM_NODELIST | head -n 1 ):7077
"$SPARK_ROOT/bin/spark-class" org.apache.spark.deploy.worker.Worker $MASTER_NODE
fi
fi
Now to start the sbatch job and after that example.jar
:
mkdir -p -- "$HOME/spark/logs"
jobid=$( sbatch ./start_spark_slurm.sh )
jobid=${jobid##Submitted batch job }
MASTER_WEB_UI=''
while [ -z "$MASTER_WEB_UI" ]; do
sleep 1s
if [ -f "$HOME/spark/logs/$jobid.err" ]; then
MASTER_WEB_UI=$( sed -n -r 's|.*Started MasterWebUI at (http://[0-9.:]*)|\1|p' "$HOME/spark/logs/$jobid.err" )
fi
done
MASTER_ADDRESS=$( cat -- "$HOME/spark/logs/${jobid}_spark_master" )
"$HOME/spark-1.5.2-bin-hadoop2.6/bin/spark-submit" --master "$MASTER_ADDRESS" example.jar
firefox "$MASTER_WEB_UI"
As maxmlnkn answer states, you need a mechanism to setup/launch the appropriate Spark daemons in a Slurm allocation before a Spark jar can be executed via spark-submit.
Several scripts/systems to do this setup for you have been developed. The answer you linked above mentions Magpie @ https://github.com/LLNL/magpie (full disclosure: I'm the developer/maintainer of those scripts). Magpie provides a job submission file (submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark) for you to edit and put your cluster specifics & job scripts in to execute. Once configured you'd submit this via 'sbatch -k ./magpie.sbatch-srun-spark'). See doc/README.spark for more details.
I will mention there are other scripts/systems to do this for you. I lack experience with them, so can't comment beyond just linking them below.
https://github.com/glennklockwood/myhadoop
https://github.com/hpcugent/hanythingondemand