Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

Solution 1:

I had this exact problem when multiple users were trying to run on our cluster at once. The fix was to change setting of the scheduler.

In the file /etc/hadoop/conf/capacity-scheduler.xml we changed the property yarn.scheduler.capacity.maximum-am-resource-percent from 0.1 to 0.5.

Changing this setting increases the fraction of the resources that is made available to be allocated to application masters, increasing the number of masters possible to run at once and hence increasing the number of possible concurrent applications.

Solution 2:

I got this error in this situation:

  1. MASTER=yarn (or yarn-client)
  2. spark-submit runs on a computer outside of the cluster and there is no route from the cluster to it because it's hidden by a router

Logs for container_1453825604297_0001_02_000001 (from ResourceManager web UI):

16/01/26 08:30:38 INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable.
16/01/26 08:31:41 ERROR yarn.ApplicationMaster: Failed to connect to driver at 192.168.1.180:33074, retrying ...
16/01/26 08:32:44 ERROR yarn.ApplicationMaster: Failed to connect to driver at 192.168.1.180:33074, retrying ...
16/01/26 08:32:45 ERROR yarn.ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Failed to connect to driver!
    at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:484) 

I workaround it by using yarn cluster mode: MASTER=yarn-cluster.

On another computer which is configured in the similar way, but is's IP is reachable from the cluster, both yarn-client and yarn-cluster work.

Others may encounter this error for different reasons, and my point is that checking error logs (not seen from terminal, but ResourceManager web UI in this case) almost always helps.

Solution 3:

There are three ways we can try to fix this issue.

  1. Check for spark process on your machine and kill it.

Do

ps aux | grep spark

Take all the process id's with spark processes and kill them, like

sudo kill -9 4567 7865
  1. Check for number of spark applications running on your cluster.

To check this, do

yarn application -list

you will get an output similar to this:

Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
                Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1496703976885_00567       ta da                SPARK        cloudera       default             RUNNING           UNDEFINED              20%             http://10.0.52.156:9090

Check for the application id's, if they are more than 1, or more than 2, kill them. Your cluster cannot run more than 2 spark applications at the same time. I am not 100% sure about this, but on cluster if you run more than two spark applications, it will start complaining. So, kill them Do this to kill them:

yarn application -kill application_1496703976885_00567
  1. Check for your spark config parameters. For example, if you have set more executor memory or driver memory or number of executors on your spark application that may also cause an issue. So, reduce of any of them and run your spark application, that might resolve it.