How to deal with executor memory and driver memory in Spark?

I am confused about dealing with executor memory and driver memory in Spark.

My environment settings are as below:

Memory 128 G, 16 CPU for 9 VM
Centos
Hadoop 2.5.0-cdh5.2.0
Spark 1.1.0

Input data information:

3.5 GB data file from HDFS

For simple development, I executed my Python code in standalone cluster mode (8 workers, 20 cores, 45.3 G memory) with spark-submit. Now I would like to set executor memory or driver memory for performance tuning.

From the Spark documentation, the definition for executor memory is

Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g).

How about driver memory?

The memory you need to assign to the driver depends on the job.

If the job is based purely on transformations and terminates on some distributed output action like rdd.saveAsTextFile, rdd.saveToCassandra, ... then the memory needs of the driver will be very low. Few 100's of MB will do. The driver is also responsible of delivering files and collecting metrics, but not be involved in data processing.

If the job requires the driver to participate in the computation, like e.g. some ML algo that needs to materialize results and broadcast them on the next iteration, then your job becomes dependent of the amount of data passing through the driver. Operations like .collect,.take and takeSample deliver data to the driver and hence, the driver needs enough memory to allocate such data.

e.g. If you have an rdd of 3GB in the cluster and call val myresultArray = rdd.collect, then you will need 3GB of memory in the driver to hold that data plus some extra room for the functions mentioned in the first paragraph.

In a Spark Application, Driver is responsible for task scheduling and Executor is responsible for executing the concrete tasks in your job.

If you are familiar with MapReduce, your map tasks & reduce tasks are all executed in Executor(in Spark, they are called ShuffleMapTasks & ResultTasks), and also, whatever RDD you want to cache is also in executor's JVM's heap & disk.

So I think a few GBs will just be OK for your Driver.

How to deal with executor memory and driver memory in Spark?

Related

Recent Posts