WordPress admin pages timeout, or take 30s to load on Elastic Beanstalk, PHP 5.6, Apache, RDS, and CloudFlare
It looks like you have 2 separate issues:
1) Connections take 30s to complete.
2) Exceeding 1000 connections limit for db.r3.large RDS aurora instance after which new connections will time out since php can no longer establish new sessions to RDS.
1st looks a lot like DNS name resolution issue. Check how your database connections are configured (IP versus FQDN). If it's FQDN - check your /etc/nsswitch.conf and check your cloudflare. You want to make sure that forward and reverse name resolution works as it should and this 30 second delay is not caused by that. You can also do tcpdump port 53 to check what's going on with the name resolution.
For 2nd you need to figure out why number of connections exceeds 1000.
If you are not using RDS Aurora, what is your "normal" number of connections? Depending on which DB is used there will be different queries to check for that. If it's normally consuming more than 1000 connections - then you would have to adjust your RDS instance accordingly (or re-engineer your app, perhaps you are using word press plugin that drives that number of connections high).
If on non RDS database number of connections is significantly lower than 1000 - then you would have to troubleshoot what's causing those extra connections.
Few links to start:
- https://aws.amazon.com/blogs/database/analyzing-amazon-rds-database-workload-with-performance-insights/
- https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/php-hawordpress-tutorial.html
OK I found the answer. I was introduced to the tool MyTop which is basically Top for MYSQL queries. Thanks to that tool, I was able to to see that there was a single query running thousands and thousands of times and wound up just choking everything.
After I identified the query, I jumped onto new relic and using their database stack trace was able to find which php file was running the code that made the request and it was there I discovered a while loop which was out of control. I am unsure why that loop wasn't a problem on the old server, but I commented that code out and now AWS runs like a dream.