Finding Nginx/PHP-FPM bottleneck that is causing random 502 gateway errors

Solution 1:

Official recommendation: worker_processes = number of cores CPU

set worker_processes 16;

Solution 2:

Sporadic 502 errors from load balancers, such as HAProxy and nginx, are usually caused by something getting cut off mid stream between the LB and the Web Server.

Try running one of your web servers, or a test copy of it, through GDB and see if you see a segmentation fault when generating test traffic (use ab or jMeter or similar to simulate the traffic).

I had to solve a very similar scenario/problem recently. I'd ruled out resources etc causing the issue as I had pretty comprehensive monitoring that was helping me there. In the end I found that the 502 error was coming from the web server behind the load balancer returning invalid (in this case empty) HTTP responses to the LB.

I took one of the web servers and stopped the web server, then started it again via gdb then browsed the sited. Eventually after some clicking around I saw a segmentation fault happen, and this caused a 502 error to be visible. I took the backtrace from GDB and submitted it to PHP team as a bug, but the only fix for me was to switch distribution to work around the PHP bug that was there.

The segfault was causing the web server to send invalid content to the LB, and the LB was displaying a 502 error because as far as it's concerned the web server has disappeared "mid flow".

I know this doesn't directly answer your question, but it's a place to start looking. Assuming you do see a segfault you can get the stack trace from GDB then you can hopefully work backwards and find what function is causing the segmentation fault.