Aws ecs fargate ResourceInitializationError: unable to pull secrets or registry auth

I am trying to run a private repository on aws-ecs-fargate-1.4.0 platform.

For private repository authentication, I have followed the docs and it was working well.

Somehow after updating existing service many times it goes fail to run the task and complain the error like

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to get registry auth from asm: service call has been retried 1 time(s): asm fetching secret from the service for <secretname>: RequestError: ...

I haven't change the ecsTaskExecutionRole and it contains all required policies to fetch secret value.

  1. AmazonECSTaskExecutionRolePolicy
  2. CloudWatchFullAccess
  3. AmazonECSTaskExecutionRolePolicy
  4. GetSecretValue
  5. GetSSMParamters

Solution 1:

AWS employee here.

What you are seeing is due to a change in how networking works between Fargate platform version 1.3.0, and Fargate platform version 1.4.0. As part of the change from using Docker to using containerd we also made some changes to how networking works. In version 1.3.0 and below each Fargate task got two network interfaces:

  • One network interface was used for the application traffic from your application container(s), as well as for logs and container image layer pulls.
  • A secondary network interface was used by the Fargate platform itself, to get ECR authentication credentials, and fetch secrets.

This secondary network interface had some downsides though. This secondary traffic did not show up in your VPC flow logs. Also while most traffic stayed in the customer VPC, the secondary network interface was sending traffic outside of your VPC. A number of customers complained that they did not have the ability to specify network level controls on this secondary network interface and what it was able to connect to.

To make the networking model less confusing and give customers more control, we changed in Fargate platform version 1.4.0 to using a single network interface and keeping all traffic inside of your VPC, even the Fargate platform traffic. The Fargate platform traffic for fetching ECR authentication and task secrets now uses the same task network interface as the rest of your task traffic, and you can observe this traffic in VPC flow logs, and control this traffic using the routing table in your own AWS VPC.

However, with this increased ability to observe and control the Fargate platform networking, you also become responsible for ensuring that there is actually a network path configured in your VPC that allows the task to communicate with ECR and AWS Secrets Manager.

There are a few ways to solve this:

  • Launch tasks into a public subnet, with a public IP address, so that they can communicate to ECR and other backing services using an internet gateway
  • Launch tasks in a private subnet that has a VPC routing table configured to route outbound traffic via a NAT gateway in a public subnet. This way the NAT gateway can open a connection to ECR on behalf of the task.
  • Launch tasks in a private subnet and make sure you have AWS PrivateLink endpoints configured in your VPC, for the services you need (ECR for image pull authentication, S3 for image layers, and AWS Secrets Manager for secrets).

You can read more about this change in this official blogpost, under the section "Task elastic network interface (ENI) now runs additional traffic flows"

https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/

Solution 2:

I'm not completely sure about your setup but after I disabled the NAT-Gateways to save some $, I had a very similar error message on the aws-ecs-fargate-1.4.0 platform:

Stopped reason: ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://api.ecr....

It turned out that I had to create VPC Endpoints to these Service names:

  • com.amazonaws.REGION.s3
  • com.amazonaws.REGION.ecr.dkr
  • com.amazonaws.REGION.ecr.api
  • com.amazonaws.REGION.logs
  • com.amazonaws.REGION.ssm

And I had to downgrade to the aws-ecs-fargate-1.3.0 platform. After the downgrade the Docker images could be pulled from ECR and the deployments succeeded again.

If you are using the secret manager without a NAT-Gateway, it might be that you have to create a VPC Endpoint for com.amazonaws.REGION.secretsmanager.