How to use AWS Batch?
I am trying to use the new AWS Batch feature: https://aws.amazon.com/batch/
I cannot get even the simplest Batch job to run (using the demo which is "echo hello world"). The job just keeps getting stuck in the runnable state.
To try and isolate the issue, I am using all of the default settings on a brand new AWS account.
My understanding is that I should not have to launch any EC2 instances manually to use this feature, that AWS Batch should do this for me. It seems to be waiting for an available EC2 instance to run the job, though. Shouldn't it just start up an EC2 instance to run the job by itself?
Thanks in advance.
I noticed that when I specified a job definition with 8000 MiB an instance would be spun up that only had 7986MB and my job would get stuck in the Runnable state.
8000 MiB is equal to 8388.608MB so it looks like the instance being spun up doesn't have enough memory available to run the job and so it hangs.
If I create a job definition with 7000 MiB than my job no longer gets stuck in the Runnable state as it still uses the same instance with 7986MB of memory.
There is a troubleshooting guide in the Batch documentation for troubleshooting "Jobs Stuck in RUNNABLE Status". https://docs.aws.amazon.com/batch/latest/userguide/troubleshooting.html#job_stuck_in_runnable
If your compute environment contains compute resources, but your jobs do not progress beyond the RUNNABLE status, then there is something preventing the jobs from actually being placed on a compute resource. Here are some common causes for this issue:
The awslogs log driver is not configured on your compute resources
AWS Batch jobs send their log information to CloudWatch Logs. To enable this, you must configure your compute resources to use the awslogs log driver. If you base your compute resource AMI off of the Amazon ECS-optimized AMI (or Amazon Linux), then this driver is registered by default with the ecs-init package. If you use a different base AMI, then you must ensure that the awslogs log driver is specified as an available log driver with the ECS_AVAILABLE_LOGGING_DRIVERS environment variable when the Amazon ECS container agent is started. For more information, see Compute Resource AMI Specification and Creating a Compute Resource AMI.
Insufficient resources
If your job definitions specify more CPU or memory resources than your compute resources can allocate, then your jobs will never be placed. For example, if your job specifies 4 GiB of memory, and your compute resources have less than that, then the job cannot be placed on those compute resources. In this case, you must reduce the specified memory in your job definition or add larger compute resources to your environment.
Amazon EC2 instance limit reached
The number of Amazon EC2 instances that your account can launch in an AWS region is determined by your EC2 instance limit. Certain instance types have a per-instance-type limit as well. For more information on your account's Amazon EC2 instance limits (including how to request a limit increase), see Amazon EC2 Service Limits in the Amazon EC2 User Guide for Linux Instances
Other very common issues I see which cause this would be:
- No route to the internet
- CPU/Memory in job definition is higher than the instances
- Instance is not registered with ECS cluster
- Agent is disconnected - https://aws.amazon.com/premiumsupport/knowledge-center/ecs-agent-disconnected/
Additional troubleshooting steps you can take:
- Launch associated ECS task definition manually in your cluster
- SSH and try docker run from inside container instance
- Curl ECS and Batch endpoints from inside container instance
- Remove CPU/Memory restraints on job definition
- Review /etc/ecs/ecs.config
- Get ECS logs - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-logs-collector.html