DNS Failover with multiple Nginx load balancers

Solution 1:

I love this approach, it is my most favorite and I will buy you a beer if you are ever in San Francisco!

Two answers, first to your 502 issue you should add this to your nginx, so if there are at least some capable nodes nginx will retry(by default on a 502 it just gives up):

http://wiki.nginx.org/HttpProxyModule#proxy_next_upstream

proxy_next_upstream 

syntax: proxy_next_upstream [error|timeout|invalid_header|http_500|http_502|http_503|http_504|http_404|off];

Secondly, for your 'back to DNS', you need to change the approach slightly. For these setups what I've done usually is pull DNS all the way back to the app nodes themself which tests the connectivity all the way through the load balancer and to the end node. As a bonus you can integrate DNS with your application and have it shut down the DNS server if the app is dead. The idea here is to have the clients DNS request 'test' that the entire path works, not just the connectivity to the LB. Obviously you can't use NGINX for this, I've used pf rules for this, you can do the same thing in iptables. Where you just round robin requests to backend nodes and run bind on your backend servers. The idea then is to make sure you have multiple NS entries, one to each 'LB' you have. The client will take care of testing each NS record, in testing I've done the average failover time is 2 seconds and it worked for 99% of the operating systems we looked at. Let me know if that makes sense. It will work better than any scenario that tries to recover after the client has already made the first TCP request.

With this solution I've built sites that maintain 100% availability according to Gomez and Keynote monitoring. As you already mentioned it can cause some initial performance penalty for the DNS lookup but the site always works and customers love that (as does my pager).