NGINX timeout after +200 concurrent connections

Solution 1:

You will need to dump your network connections during the test. While the server may have near zero load, your TCP/IP stack could be billing up. Look for TIME_WAIT connections in a netstat output.

If this is the case, then you will want to check into tuning tcp/ip kernel paramters relating to TCP Wait states, TCP recyling, and similar metrics.

Also, you have not described what is being tested.

I always test:

  • static content (image or text file)
  • simple php page (phpinfo for example)
  • application page

This may not apply in your case but is something I do when performance testing. Testing different types of files can help you pinpoint the bottlneck.

Even with static content, testing different size of files is important as well to get timeouts and other metrics dialed in.

We have some static content Nginx boxes handling 3000+ active connections. So it Nginx can certainly do it.

Update: Your netstat shows a lot of open connections. May want to try tuning your TCP/IP stack. Also, what file are you requesting? Nginx should quickly close the port.

Here is a suggestion for sysctl.conf:

net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

These values are very low but I have had success with them on high concurrency Nginx boxes.

Solution 2:

Yet another hypothesis. You have increased worker_rlimit_nofile, but the maximum number of clients is defined in the documentation as

max_clients = worker_processes * worker_connections

What if you try to raise worker_connections to, like, 8192? Or, if there's enough CPU cores, increase worker_processes?

Solution 3:

I was having a very similar issue with a nginx box serving as load balancer with an upstream of apache servers.

In my case I was able to isolate the problem to be networking related as the upstream apache servers became overloaded. I could recreate it with simple bash scripts while the overall system was under load. According to an strace of one of the hung processes the connect call was getting an ETIMEDOUT.

These settings (on the nginx and upstream servers) eliminated the problem for me. I was getting 1 or 2 timeouts per minute before making these changes (boxes handling ~100 reqs/s) and now get 0.

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_syn_backlog = 20480
net.core.netdev_max_backlog = 4096
net.ipv4.tcp_max_tw_buckets = 400000
net.core.somaxconn = 4096

I would not recommend using net.ipv4.tcp_tw_recycle or net.ipv4.tcp_tw_reuse, but if you want to use one go with the latter. They can cause bizarre issues if there is any kind of latency at all and the latter is at least the safer of the two.

I think having tcp_fin_timeout set to 1 above may be causing some trouble as well. Try putting it at 20/30 - still far below the default.