Why is DNS failover not recommended?

By 'DNS failover' I take it you mean DNS Round Robin combined with some monitoring, i.e. publishing multiple IP addresses for a DNS hostname, and removing a dead address when monitoring detects that a server is down. This can be workable for small, less trafficked websites.

By design, when you answer a DNS request you also provide a Time To Live (TTL) for the response you hand out. In other words, you're telling other DNS servers and caches "you may store this answer and use it for x minutes before checking back with me". The drawbacks come from this:

  • With DNS failover, a unknown percentage of your users will have your DNS data cached with varying amounts of TTL left. Until the TTL expires these may connect to the dead server. There are faster ways of completing failover than this.
  • Because of the above, you're inclined to set the TTL quite low, say 5-10 minutes. But setting it higher gives a (very small) performance benefit, and may help your DNS propagation work reliably even if there is a short glitch in network traffic. So using DNS based failover goes against high TTLs, but high TTLs are a part of DNS and can be useful.

The more common methods of getting good uptime involve:

  • Placing servers together on the same LAN.
  • Place the LAN in a datacenter with highly available power and network planes.
  • Use a HTTP load balancer to spread load and fail over on individual server failures.
  • Get the level of redundancy / expected uptime you require for your firewalls, load balancers and switches.
  • Have a communication strategy in place for full-datacenter failures, and the occasional failure of a switch / database server / other resource that cannot easily be mirrored.

A very small minority of web sites use multi-datacenter setups, with 'geo-balancing' between datacenters.


DNS failover defintely works great. I have been using it for many years to manually shift traffic between datacenters, or automatically when monitoring systems detected outages, connectivity issues, or overloaded servers. When you see the speed at which it works, and the volumes of real world traffic that can be shifted with ease - you'll never look back. I use Zabbix for monitoring all of my systems and the visual graphs that show what happens during a DNS failover situation put all my doubts to and end. There may be a few ISPs out there that ignore TTLs, and there are some users still out there with old browsers - but when you are looking at traffic from millions of page views a days across 2 datacenter locations and you do a DNS traffic shift - the residual traffic coming in that ignores TTLs is laughable. DNS failover is a solid technique.

DNS was not designed for failover - but it was designed with TTLs that work amazingly for failover needs when combined with a solid monitoring system. TTLs can be set very short. I have effectively used TTLs of 5 seconds in production for lightening fast DNS failover based solutions. You have to have DNS servers capable of handling the extra load - and named won't cut it. However, powerdns fits the bill when backed with a mysql replicated databases on redundant name servers. You also need a solid distributed monitoring system that you can trust for the automated failover integration. Zabbix works for me - I can verify outages from multiple distributed Zabbix systems almost instantly - update mysql records used by powerdns on the fly - and provide nearly instant failover during outages and traffic spikes.

But hey - I built a company that provides DNS failover services after years of making it work for large companies. So take my opinion with a grain of salt. If you want to see some zabbix traffic graphs of high volume sites during an outage - to see for yourself exactly how good DNS failover works - email me I'm more than happy to share.


The issue with DNS failover is that it is, in many cases, unreliable. Some ISPs will ignore your TTLs, it doesn't happen immediately even if they do respect your TTLs, and when your site comes back up, it can lead to some weirdness with sessions when a user's DNS cache times out, and they end up heading over to the other server.

Unfortunately, it is pretty much the only option, unless you're large enough to do your own (external) routing.


The prevalent opinion is that with DNS RR, when an IP goes down, some clients will continue to use the broken IP for minutes. This was stated in some of the previous answers to the question and it is also wrote on Wikipedia.

Anyway,

http://crypto.stanford.edu/dns/dns-rebinding.pdf explains that it is not true for most of the current HTML browsers. They will try the next IP in seconds.

http://www.tenereillo.com/GSLBPageOfShame.htm seems to be even more strong:

The use of multiple A records is not a trick of the trade, or a feature conceived by load balancing equipment vendors. The DNS protocol was designed with support for multiple A records for this very reason. Applications such as browsers and proxies and mail servers make use of that part of the DNS protocol.

Maybe some expert can comment and give a more clear explanation of why DNS RR is not good for high availability.

Thanks,

Valentino

PS: sorry for the broken link but, as new user, I cannot post more than 1