Browser-based DNS failover using multiple A records

It has recently come to my attention that setting up multiple A records for a hostname can be used not only for round-robin load-balancing but also for automatic failover.

So I tried testing it:

  1. I loaded a page from our domain
  2. Noted which of our servers had served the page
  3. Turned off the web server on that host
  4. Reloaded the page

And indeed the browser automatically tried a different server to load the page. This worked in Opera, Safari, IE, and Firefox. Only Chrome failed to try a different server.

But after leaving that server offline for a few minutes and looking at the access logs, I found that the number of requests to the other servers had not significantly increased. With 1 out of 3 servers offline, I had expected accesses to each of the remaining 2 servers to roughly increase by 50%, but instead I only saw 7-10%. That can only mean in-browser DNS failover does not work for the majority of browsers/visitors, which directly contradicts what I had just tested.

Does anyone have an idea what is up with browsers' DNS failover behavior? What possible reason could there be why automatic failover works for me but not the majority of our visitors?

edit: To make myself clear, I made absolutely no change to our DNS settings; there's no TTL or propagation issue here, it's all about how the client handles the multiple A records.


OK I am going to start by saying DNS is not a good failover system in any way, you need a reverse proxy or load balancer. There are several reasons why the experience is not the same. First of all in chrome it uses The OS to grab DNS info so that is dependent on the OS for the IPs, so the OS in this case might only give it one IP.

As far as the other browsers its highly dependent on how they do DNS to how it'll work. So the browser itself might decide to not try the other IPs or even try the same one several times depending on the response the DNS server has.

This brings us to the DNS server itself, most do not respect your TTL records and keep then how ever long it feels, meaning Users could get your old IP for quite a while...

Fourthly user experience, do you want users to have to refresh 3 or 4 times to get your website? Do you have any session or login based stuff on your site, what happens if the browser gets another IP in the middle of the session. If you really need HA and uptime you really need to consider doing it right,honestly or it will end up more fractured than using just one server.


To me it's a great deal if you don't want to pay for expensive load balancers. See my reply here about how it's handled by browsers: https://serverfault.com/a/868535/114520

Now, for your concern, how did you monitor accesses? Was it the size of some access_log? Was it the requests per second on your webserver?

Maybe you have some caching solution on the webserver, which won't hit your dynamic server (PHP, Java...) if the request is already in cache. The more servers, the more requests before caching (if they don't share cache).

Before assuming it's a DNS issue, add a real monitoring: for example live analytics tracker, or something like that. Then shutdown one server, and see if live tracker shows a decrease in current users on the website.

For many years I've used and still use this setup with a real pleasure. I only added some more failover solutions:

  • Round-Robin on 2 or 3 nodes
  • each node has:
    • Varnish with director/probes to all backends
    • lighttpd (Apache or nginx will do!) on another port with fastcgi
    • PHP-FPM pool

If one PHP-FPM goes down, Varnish probe will fail and remove the backend until the probe is good again. If Varnish fails, then Round-Robin+browser will handle the change to another node.