504s on Elastic Beanstalk app deploy (user -> ELB -> Elastic Beanstalk mod_wsgi)

I have a Python Elastic Beanstalk load-balanced app. Here is the path a user request takes on its way into the Elastic Beanstalk app:

user -> Elastic Beanstalk ELB -> Elastic Beanstalk mod_wsgi

The problem:

The first ~2-4 requests from user after eb deploy of a new app version will generate 504 errors from the ELB.

After these ~2-4 requests that generate 504s, everything is fine! 200s all around.

When the 504s happen, zero requests make it through to the Elastic Beanstalk mod_wsgi app according to /var/httpd/access_log. I only see the 200s after the ELB has decided to start working again.

Things I have tried that didn't work:

  1. I increased the Elastic Beanstalk ELB Idle Timeout to 300 seconds
  2. I increased the Elastic Beanstalk mod_wsgi apache KeepAliveTimeout to 300 seconds as suggested here: http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/ts-elb-error-message.html

One might say, "just live with the 504s!"

However, the actual problem is that in my production setup, I have CloudFlare between user and Elastic Beanstalk ELB. CloudFlare is set to aggressively cache .css and .js files, since I append md5 hashes to static file URLs. When requests for these important files fail with 504, CloudFlare appears to cache these failures as being 404s. Further requests for these files 404, thus breaking the visual styling of the site on every deploy.

Deploying the Elastic Beanstalk app again with the same app version will fix the CloudFlare 404 problem. This is not a great solution. I want to keep on using CloudFlare because it makes for an excellent transparent CDN, so getting rid of it is not a solution, either.

It's hard to believe I'm alone with this issue, but Google, stackoverflow/serverfault, and the AWS forums have not yielded any solutions—or even similar problem reports. I am hoping that my description of this behavior rings a bell with someone here. Thanks in advance.


Solution 1:

I had exactly the same problem which I really think is a bug with the Beanstalk deployer.

I was using a "Rolling" deployment policy with 2 instances and batch size of 1, which should give zero downtime in theory. However in reality, during a deployment there is still a period of about 10 - 15 seconds where the ELB responds with 504.

Take a look at your "Update and Deployments" settings in your beanstalk configuration. I found that changing to "Rolling with additional batch" and using a batch size of 100% works well and gives zero downtime during an update.

Update October 2018 - I don't know how long it's been working for but Elastic Beanstalk rolling updates now works properly again with zero downtime for me.