nginx - Load Balancer - Considerable lag when upstream node is offline/down

Running nginx 1.0.15 on CentOS 6.5. I have three upstream servers and everything works fine, however when I simulate an outage, and take one of the upstream servers down, I notice considerable lag in response times (additional 5-7 seconds). The second I bring the downed server back online, the lag disappears. Also, another weird thing I noticed, if I simply stop the httpd service on the simulated outage server, the response times are normal, the lag only occurs if the server is completely down.

Here is my conf:

upstream prod_example_com {

    server app-a-1:51000;

    server app-a-2:51000;

    server app-a-3:51000;

}


server {

    # link:  http://wiki.nginx.org/MailCoreModule#server_name
    server_name example.com www.example.com *.example.com;

    #-----
    # Upstream logic
    #-----


    set $upstream_type prod_example_com;


    #-----

    include include.d/common.conf;

    # Configure logging
    access_log  /var/log/nginx/example/access/access.log access;
    error_log   /var/log/nginx/example/error.log error;

    location / {

        # link: http://wiki.nginx.org/HttpProxyModule#proxy_pass
        proxy_pass  http://$upstream_type$request_uri;

        # link: http://wiki.nginx.org/HttpProxyModule#proxy_set_header
        proxy_set_header    Host    $host;
        proxy_set_header    X-Real-IP   $remote_addr;
        proxy_set_header    X-Forwarded-For     $proxy_add_x_forwarded_for;
    }

    location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {

        # link: http://wiki.nginx.org/HttpProxyModule#proxy_pass
        proxy_pass  http://$upstream_type$request_uri;

        # link: http://wiki.nginx.org/HttpProxyModule#proxy_set_header
        proxy_set_header    Host    $host;
        proxy_set_header    X-Real-IP   $remote_addr;
        proxy_set_header    X-Forwarded-For     $proxy_add_x_forwarded_for;

        proxy_hide_header expires;
        proxy_hide_header Cache-Control

         # Even tho this reads like the older syntax, it is handled internally by nginx to set max age to now + 1 year
         expires max;

        # Allow intermediary caches the ability to cache the asset
        add_header Cache-Control "public";
    }
}

I have tried the suggestions on similar posts like this. And apparently my version of nginx is too old to support health_checks as outlined in the nginx docs. I've also tried to explicitly set the max_fails=2 and fail_timeout=120 on the app-a-3 upstream definition, but none of these seem to avoid the additional 5-7 seconds lag for every request if app-a-3 is offline.

-- Update --

Per request, here is the output for a single request when app-a-3 is completely down. The only thing I could see out of the ordinary is the 3 second lag between initial event and subsequent event.

-- Update #2 --

Looks like a few years ago Nginx decided to create Nginx Plus, which adds active health checks, but for a yearly support contract. Based on some articles I've read, Nginx got sick of making companies millions, and getting nothing in return.

As mentioned in the comments we are bootstrapping and don't have the $$ to throw at a $1,350 contract. I did find this repo which provides the functionality. Wondering if anyone has any experience with it? Stable? Performant?

Worst case scenario I will just have to bit the bullet and pay the extra $20 / month for a Linode "Node Balancer" which I am pretty sure is based off of Nginx Plus. The only problem is there is no control over the config other than a few generic options, so no way to support multiple vhost files via one balancer, and all the nodes have to be in the same datacenter.

-- Update #3 --

Here are some siege results. It seems the second node is misconfigured, as it is only able to handle about 75% of the requests the first and third nodes are handling. Also I thought it odd, that when I took the second node offline, the performance was as bad as if I took the third (better performing) node offline. Logic would dictate that if I removed the weak link (second node), that I would get better performance because the remaining two nodes perform better than the weak link, individually.

In short:

node 1, 2, 3 + my nginx = 2037 requests

node 1, 2 + my nginx  = 733 requests

node 1, 3 + my nginx = 639 requests (huh? these two perform better individually so together should be somewhere around ~1500 requests, based on 2000 requests when all nodes are up)

node 1, 3 + Linode Load Balancer = 790 requests

node 1, 2, 3 + Linode Load Balancer = 1,988 requests

If nginx sends a request to a closed port on a server with a functional IP stack, it'll get an immediate negative acknowledgement. If there's no server there to respond (or if you drop the incoming packet at a firewall) then you'll have to wait for the connection to time out.

Most load balancers have a polling mechanism and/or heartbeat for preemptively checking for a down server. You might want to look into those options. Polling isn't usually run against a web server more than once or twice a minute, but a heartbeat check for server down situations might be every second or so.

Nginx is not the most sophisticated of load balancers. If you're getting into this sort of issue you might want to look at other options.

EDIT: Something like this maybe? http://www.howtoforge.com/setting-up-a-high-availability-load-balancer-with-haproxy-heartbeat-on-debian-lenny . For a smallish installation, there's no need for separate servers, just put it on the web server boxes. That gives load balancing, but not caching. There are also HA solutions that control squid or varnish in response to a heartbeat.


A couple of things you can try

  1. Update to the latest version of nginx from the official repos http://nginx.org/en/linux_packages.html#stable
  2. Try reducing the proxy_connect_timeout setting set it to something really low for testing say 1 second. http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_connect_timeout

Over the last few weeks I have been working with NGINX pre-sales engineering team in trying to resolve the issue before I purchase the support contract. After a lot of tinkering and collaboration, the only conclusion we could surmise for the increased lag when a single node goes completely dark, is that the servers in question were all running the much older Apache 2.2.

The NGINX engineers were not able to recreate the issue using Apache 2.4.x, so that would be my suggested fix if anyone else encounters the same situation. However, for our project, I am working on shifting away from Apache altogether, and implementing NGINX with php-fpm.

In closing, our environment will be to use NGINX+ (requires the support contract) as the load balancer due to it's ability to issue health checks to upstream nodes, and issuing requests via round-robin to upstream nodes running NGINX (FOSS).