What does setMaster `local[*]` mean in spark?

Solution 1:

From the doc:

./bin/spark-shell --master local[2]

The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing.

And from here:

local[*] Run Spark locally with as many worker threads as logical cores on your machine.

Solution 2:

Master URL Meaning


local : Run Spark locally with one worker thread (i.e. no parallelism at all).


local[K] : Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).


local[K,F] : Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable)


local[*] : Run Spark locally with as many worker threads as logical cores on your machine.


local[*,F] : Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.


spark://HOST:PORT : Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.


spark://HOST1:PORT1,HOST2:PORT2 : Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default.


mesos://HOST:PORT : Connect to the given Mesos cluster. The port must be whichever you have configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.


yarn : Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

https://spark.apache.org/docs/latest/submitting-applications.html

Solution 3:

Some additional Info

Do not run Spark Streaming programs locally with master configured as "local" or "local[ 1]". This allocates only one CPU for tasks and if a receiver is running on it, there is no resource left to process the received data. Use at least "local[ 2]" to have more cores.

From -Learning Spark: Lightning-Fast Big Data Analysis

Solution 4:

Master URL

You can run Spark in local mode using local, local[n] or the most general local[*] for the master URL.

The URL says how many threads can be used in total:

local uses 1 thread only.

local[n] uses n threads.

local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).

local[N, maxFailures] (called local-with-retries) with N being * or the number of threads to use (as explained above) and maxFailures being the value of spark.task.maxFailures.

Solution 5:

You can run Spark in local mode using local, local[n] or the most general local[*] for the master URL.

The URL says how many threads can be used in total:-

local uses 1 thread only.

local[n] uses n threads.

local[*] uses as many threads as your spark local machine have, where you are running your application.

you can check by lscpu in your Linux machine

[ie@mapr2 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 2

if your machine has 56 cores means CPU then your spark jobs will be partitioned in 56 part.

NOTE:- there may be the case that in your spark cluster the spark-defaults.conf file has limited the partition value with the default value (like 10 or else) then your partitioned will be the same as default value has been set in config.

local[N, maxFailures] (called local-with-retries) with N being * or the number of threads to use (as explained above) and maxFailures being the value of spark.task.maxFailures.