Some nginx reverse proxy configs stops working once a day

I have an nginx reverse-proxy which proxies requests from an outer amazon ELB to internal ELBs.

I have 6 backend instances that handles the requests. The site-enabled configs looks like this, but there are different port numbers and proxy_pass. Everything else is identical:

server {
    listen 3000;
    location / {
            proxy_pass http://internal-prod732r8-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com:3000;
            include /etc/nginx/proxy.conf;
    }

}

Once about every 24h one of the configurations stops working. All other proxies works fine. If i restart nginx all configurations works again. There is nothing in error.log, nothing weird in access log, syslog or dmesg.

Is this something known? Have i done something wrong with my proxy configs? Are there any other logs i can look in?


The answer to this question is that ELBs sometimes change ip adresses and nginx does name resolving during start.

To fix this there is always a DNS server in your VPC at 0.2. So if the local ip CIDR is 10.0.0.0/16 the DNS server is at 10.0.0.2.

Add this to the nginx config.

resolver 10.0.0.2 valid=10s;

The proxy_pass also needs to be defined as a variable otherwise nginx will only resolve it once. So based on the configuration above this is the correct config:

server {
    listen 3000;
    location / {
            resolver 10.0.0.2 valid=10s;
            set $backend "http://internal-prod732r8-PrivateE-1GJ070M0745TT-348518554.eu-west-1.elb.amazonaws.com:3000"
            proxy_pass $backend;
            include /etc/nginx/proxy.conf;
    }
}

If your proxy_pass did not pass directly to one URL like your example shows (http://amazonaws.com), but instead to a proxy upstream farm, like this:

upstream my_upstream {
 server1 127.0.0.1:1337;
 server2 127.0.0.1:1338; 
}
location / {
 proxy_pass         http://my_upstream;
}

Then you will less concerned about one of the upstreams temporarily failing. Because they will all be doing the same job. If one fails to reply, then the next one will be proxied for that response. Peace of mind.

Nginx will skip a failed machine for x seconds automaticaly. Until you repair it, or until it returns by itself. (http://wiki.nginx.org/HttpUpstreamModule)

So whatever the reason for your interruptions may be, by distributing them on an upstream farm, this converts into an easier setup.