Is round-robin DNS a possible solution for high availability?

DNS round robin is not a good substitute for a load balancer. The DNS server will continue to hand out the IP of the node that is down, so some of your users will get to your service and some of them will not.

When the client makes the DNS query, the DNS server returns all of the IP addresses associated with that name. The magic is done by the DNS server rotating the order of that list for every query. However, it is up to the application to implement the capability of "walking" through the list until it finds an IP that works. And most applications don't do that.

Windows Telnet, oddly enough, is one such application that is smart enough to walk the linked list of returned IPs. You can see this behavior yourself if you attempt to telnet to google.com, for example. You will notice that it takes a long time to finally fail. That is because google.com has a lot of IP addresses, and the telnet client was trying every one.


Using a Load Balancer will still leave a single point of failure. If your load balancer goes offline, your website goes down.

Conterary to the above answer, Most HTTP clients already DO support trying each IP address returned from a DNS query until one returns with a valid response. Please see here:

http://blog.engelke.com/2011/06/07/web-resilience-with-round-robin-dns/

It appears that the author has tested the following browsers and found them to work fine.

Chrome 11 on Windows 7
Firefox 4.0 on Windows 7
Internet Explorer 8 on Windows 7
Opera 11 on Windows 7
Safari 5 on Windows 7
Internet Explorer 7 on Windows XP (after noticeable delay)
Firefox 4.0 on Windows XP (after noticeable delay)
Android native browser on Android 2.3.3
iPhone native browser on iOS 4.3.3

Using round robin wont do all the features of a load balancing server, things like being able to monitor response times from both servers, and route more traffic to one, if the other is not responding as fast as it should). For resilience, I would say Round Robin DNS is probably a better solution as there is no longer a single point of failure.