Best Practice to launch Spark Applications via Web Application?

Very basic answer:

Basically you can use SparkLauncher class to launch Spark applications and add some listeners to watch progress.

However you may be interested in Livy server, which is a RESTful Sever for Spark jobs. As far as I know, Zeppelin is using Livy to submit jobs and retrieve status.

You can also use Spark REST interface to check state, information will be then more precise. Here there is an example how to submit job via REST API

You've got 3 options, the answer is - check by yourself ;) It very depends on your project and requirements. Both 2 main options:

  • SparkLauncher + Spark REST interface
  • Livy server

Should be good for you and you must just check what's easier and better to use in your project

Extended answer

You can use Spark from your application in different ways, depending on what you need and what you prefer.

SparkLauncher

SparkLauncher is a class from spark-launcher artifact. It is used to launch already prepared Spark jobs just like from Spark Submit.

Typical usage is:

1) Build project with your Spark job and copy JAR file to all nodes 2) From your client application, i.e. web application, create SparkLauncher which points to prepared JAR file

SparkAppHandle handle = new SparkLauncher()
    .setSparkHome(SPARK_HOME)
    .setJavaHome(JAVA_HOME)
    .setAppResource(pathToJARFile)
    .setMainClass(MainClassFromJarWithJob)
    .setMaster("MasterAddress
    .startApplication();
    // or: .launch().waitFor()

startApplication creates SparkAppHandle which allows you to add listeners and stop application. It also provides possibility to getAppId.

SparkLauncher should be used with Spark REST API. You can query http://driverNode:4040/api/v1/applications/*ResultFromGetAppId*/jobs and you will have information about current status of an application.

Spark REST API

There is also possibility to submit Spark jobs directly via RESTful API. Usage is very similar to SparkLauncher, but it's done in pure RESTful way.

Example request - credits for this article :

curl -X POST http://spark-master-host:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
  "action" : "CreateSubmissionRequest",
  "appArgs" : [ "myAppArgument1" ],
  "appResource" : "hdfs:///filepath/spark-job-1.0.jar",
  "clientSparkVersion" : "1.5.0",
  "environmentVariables" : {
    "SPARK_ENV_LOADED" : "1"
  },
  "mainClass" : "spark.ExampleJobInPreparedJar",
  "sparkProperties" : {
    "spark.jars" : "hdfs:///filepath/spark-job-1.0.jar",
    "spark.driver.supervise" : "false",
    "spark.app.name" : "ExampleJobInPreparedJar",
    "spark.eventLog.enabled": "true",
    "spark.submit.deployMode" : "cluster",
    "spark.master" : "spark://spark-cluster-ip:6066"
  }
}'

This command will submit job in ExampleJobInPreparedJar class to cluster with given Spark Master. In the response you will have submissionId field, which will be helpful to check status of application - simply call another service: curl http://spark-cluster-ip:6066/v1/submissions/status/submissionIdFromResponse. That's it, nothing more to code

Livy REST Server and Spark Job Server

Livy REST Server and Spark Job Server are RESTful applications which allows you to submit jobs via RESTful Web Service. One major difference between those two and Spark's REST interface is that Livy and SJS doesn't require jobs to be prepared earlier and packed to JAR file. You are just submitting code which will be executed in Spark.

Usage is very simple. Codes are taken from Livy repository, but with some cuts to improve readability

1) Case 1: submitting job, that is placed in local machine

// creating client
LivyClient client = new LivyClientBuilder()
  .setURI(new URI(livyUrl))
  .build();

try {
  // sending and submitting JAR file
  client.uploadJar(new File(piJar)).get();
  // PiJob is a class that implements Livy's Job
  double pi = client.submit(new PiJob(samples)).get();
} finally {
  client.stop(true);
}

2) Case 2: dynamic job creation and execution

// example in Python. Data contains code in Scala, that will be executed in Spark
data = {
  'code': textwrap.dedent("""\
    val NUM_SAMPLES = 100000;
    val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>
      val x = Math.random();
      val y = Math.random();
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _);
    println(\"Pi is roughly \" + 4.0 * count / NUM_SAMPLES)
    """)
}

r = requests.post(statements_url, data=json.dumps(data), headers=headers)
pprint.pprint(r.json()) 

As you can see, both pre-compiled jobs and ad - hoc queries to Spark are possible.

Hydrosphere Mist

Another Spark as a Service application. Mist is very simple and similar to Livy and Spark Job Server.

Usage is very very similar

1) Create job file:

import io.hydrosphere.mist.MistJob

object MyCoolMistJob extends MistJob {
    def doStuff(parameters: Map[String, Any]): Map[String, Any] = {
        val rdd = context.parallelize()
        ...
        return result.asInstance[Map[String, Any]]
    }
} 

2) Package job file into JAR 3) Send request to Mist:

curl --header "Content-Type: application/json" -X POST http://mist_http_host:mist_http_port/jobs --data '{"path": "/path_to_jar/mist_examples.jar", "className": "SimpleContext$", "parameters": {"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}, "namespace": "foo"}'

One strong thing, that I can see in Mist, is that it has out-of-the-box support for streaming jobs via MQTT.

Apache Toree

Apache Toree was created to enable easy interactive analitics for Spark. It doesn't require any JAR to be built. It's working via IPython protocol, but not only Python is supported.

Currently documentation focuses on Jupyter notebook support, but there is also REST-style API.

Comparison and conclusions

I've listed few options:

  1. SparkLauncher
  2. Spark REST API
  3. Livy REST Server and Spark Job Server
  4. Hydrosphere Mist
  5. Apache Toree

All of them are good for different use cases. I can distinguish few categories:

  1. Tools that requires JAR files with job: Spark Launcher, Spark REST API
  2. Tools for interactive and pre-packaged jobs: Livy, SJS, Mist
  3. Tools that focus on interactive analitics: Toree (however there may be some support for pre-packaged jobs; no documentation is published at this moment)

SparkLauncher is very simple and is a part of Spark project. You are writing job configuration in plain code, so it can be easier to build than JSON objects.

For fully RESTful-style submitting, consider Spark REST API, Livy, SJS and Mist. Three of them are stable projects, which have some production use cases. REST API also requires jobs to be pre-packaged and Livy and SJS don't. However remember, that Spark REST API is by default in each Spark distribution and Livy/SJS is not. I don't know much about Mist, but - after a while - it should be very good tool to integrate all types of Spark jobs.

Toree is focusing on interactive jobs. It's still in incubation, but even now you can check it's possibilities.

Why use custom, additional REST Service, when there is built-in REST API? SaaS like Livy is one entry point to Spark. It manages Spark context and is only on one node than can in other place than cluster. They also enables interactive analytics. Apache Zeppelin uses Livy to submit user's code to Spark


Here an example of SparkLauncher T.Gawęda mentioned:

SparkAppHandle handle = new SparkLauncher()
    .setSparkHome(SPARK_HOME)
    .setJavaHome(JAVA_HOME)
    .setAppResource(SPARK_JOB_JAR_PATH)
    .setMainClass(SPARK_JOB_MAIN_CLASS)
    .addAppArgs("arg1", "arg2")
    .setMaster("yarn-cluster")
    .setConf("spark.dynamicAllocation.enabled", "true")
    .startApplication();

Here you can find an example of java web application with Spark job bundled together in a single project. Through SparkLauncher you can get SparkAppHandle which you can use to get info about job status. If you need a progress status you can use Spark rest-api:

http://driverHost:4040/api/v1/applications/[app-id]/jobs

The only dependency you will need for SparkLauncher:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-launcher_2.10</artifactId>
    <version>2.0.1</version>
</dependency>