Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?

Add the following JVM arg when you launch spark-shell or spark-submit:

-Dspark.executor.memory=6g

You may also consider to explicitly set the number of workers when you create an instance of SparkContext:

Distributed Cluster

Set the slave names in the conf/slaves:

val sc = new SparkContext("master", "MyApp")

In the documentation (http://spark.apache.org/docs/latest/running-on-yarn.html) you can read how to configure the executors and the memory limit. For example:

--master yarn-cluster --num-executors 10 --executor-cores 3 --executor-memory 4g --driver-memory 5g  --conf spark.yarn.executor.memoryOverhead=409

The memoryOverhead should be the 10% of the executor memory.

Edit: Fixed 4096 to 409 (Comment below refers to this)


Adjusting the memory is probably a good way to go, as has already been suggested, because this is an expensive operation that scales in an ugly way. But maybe some code changes will help.

You could take a different approach in your combine function that avoids if statements by using the combinations function. I'd also convert the second element of the tuples to doubles before the combination operation:

tuples.

    // Convert to doubles only once
    map{ x=>
        (x._1, x._2.toDouble)
    }.

    // Take all pairwise combinations. Though this function
    // will not give self-pairs, which it looks like you might need
    combinations(2).

    // Your operation
    map{ x=>
        (toKey(x{0}._1, x{1}._1), x{0}._2*x{1}._2)
    }

This will give an iterator, which you can use downstream or, if you want, convert to list (or something) with toList.


I had the same issue during long regression fit. I cached the train and test set. It solved my problem.

train_df, test_df = df3.randomSplit([0.8, 0.2], seed=142)
pipeline_model = pipeline_object.fit(train_df)

pipeline_model line was giving java.lang.OutOfMemoryError: GC overhead limit exceeded But when I used

train_df, test_df = df3.randomSplit([0.8, 0.2], seed=142)
train_df.cache()
test_df.cache()
pipeline_model = pipeline_object.fit(train_df)

It worked.