A whole data center would need to go down or be unreachable for this to apply. Your backup at another data center would then be reached by routing the IP addresses to the other data center. This would happen through the BGP route announcements from the primary data center no longer being provided. The secondary announcements from the secondary data center would then be used.

Smaller businesses are generally not large enough to justify the expense of portable IP address allocations and their own autonomous system number to announce BGP routes with. In this case a provider would multiple locations is the way to go.

You either have to be reached via your original IP addresses, or via a change of IP address done by DNS. Since DNS is not designed to do this in the ways needed by what "failover" means (users can be out of reach by at least as long as your TTL, or the TTL imposed by some caching servers), going to the backup site with the same IPs is the best solution.


This started off as a comment...but it's getting too long.

Sadly most of the answers to the previous question are wrong: they assume that the failover has something to do with the TTL. The top voted answer is SPECTACTULARLY wrong, and notably cites no sources. The TTL applies to the zone record as a whole and has nothing to do with Round Robin.

From RFC 1794 (which is all about Round Robin DNS serving)

There is no use in handing out information with TTLs of an hour [or less]

(IME it's nearer to 3 hours before you get full propogation).

From RFC 1035

When several RRs of the same type are available for a
 particular owner name, the resolver should either cache them
 all or none at all

RFC 1034 set out the requirements for Negative caching - a method for indicating that all requests must be served fresh from the authoritative DNS server (in which case the TTL does control failover) - in my experience support for this varies.

Since any failover would have to be implemented high in the client stack, it's arguably not part of TCP/IP or DNS - indeed, SIP, SMTP, RADIUS and other protocols running on top of TCP/IP define how the client should work with Round Robin - RFC 2616 (HTTP/1.1) is remarkable in not mentioning how it should behave.

However, in my experience, every browser and most other HTTP clients written in the last 10 years will transparently check additional A RRs if the connection appears to be taking longer than expected. And it's not just me:

  • http://www.nber.org/sys-admin/dns-failover.html
  • http://blog.engelke.com/tag/client-retry/
  • http://support.rightscale.com/12-Guides/Designers_Guide/Cloud_Solution_Architectures/Designing_and_Deploying_High-Availability_Websites
  • http://www-archive.mozilla.org/docs/netlib/dns.html

Failover times vary by implementation but are in the region of seconds. It's not an ideal solution since (due to the limits of DNS) publishing of failed node takes the DNS TTL - in the meantime you have to rely on client side detection.

Round-Robin is not a substitute for other HA mechanisms within a site. But it does complement it (the guys who wrote HAProxy recommend using a pair of installations accessed via round robin DNS). It is the best supported mechanism for implementing HA across multiple sites: indeed, as far as I can determine, it is the only supported mechansim for failover available on standard clients.


The simplest approach to dual DC redundancy would be a L2 MPLS VPN between the two sites, along with maintaining the BGP sessions between the two.

You essentially can then just have a physical IP per server and a virtual IP that floats between the two (HSRP/VRRP/CARP etc.). Your DNS would be routed to this particular IP and directed accordingly.

The next consideration would be split brain - but that's another question for another time.

Juniper wrote a good white paper on dual DC management with MPLS, you can grab the PDF here http://www.juniper.net/us/en/local/pdf/whitepapers/2000407-en.pdf