I would like to implement Round Robin DNS but not for load balancing

I have a service that runs on a server in a big colocation facility. This server is where other servers report to notify up/down. Very basic stuff. The client agent on each remote server has 1 entry to point to - to report to - and the software has no fault tolerance.

What I would like to do is implement Round-Robin DNS to deal with the main internet connection for the monitoring server at the colo. This system has a large connection but should it go offline I get a bunch of false alerts that the agent servers are offline - when they are in fact not - its that the colo line is down or the firewall for that line is down.

If I do 2 entries in DNS the first being the big bandwidth and main firewall the second being the lower bandwidth and the smaller firewall. Then will these tiny packets of "I am online/offline" from the target agents work better? I know this is not optimum but the software doesn't have code for 2 separate entries for the agents to try. The reporting server isn't offline its rock solid(dual sans and 3 VMWare servers - redundant)... But I have a single point of failure in the firewall and the main line. Just want to try to make this a little better should that line or firewall fail.

Thoughts?


If I do 2 entries in DNS the first being the big bandwidth and main firewall the second being the lower bandwidth and the smaller firewall. Then will these tiny packets of "I am online/offline" from the target agents work better?

No. This is what will happen if your main firewall is down:

  1. Client system does DNS query and gets entry #1 which points to the main firewall.
  2. Your client now has an IP address. DNS's responsibility is done.
  3. Client tries to access IP address, but that address has no connectivity.
  4. Tears of anguish.

DNS is a simple key-value store and has no knowledge of anything beyond that. Your systems will still fail if they use round robin. To be fair, they'll fail half as much part of the time, which implies succeeding another part of the time. In a scenario where one of the firewalls is down, then round robin will give out the IP address of the functioning firewall / internet connection at your colocation space half the time very unpredictable results. At any given point, with one of two firewalls down and two round robin entries, half only some of the DNS queries will be to the functioning IP address. So, looking on the bright side, I guess that's better than nothing?

The real solution to the problem is to either make the connections more reliable by better providers, SLAs, and hardware, or to use link bonding of some kind. Use something like an Elfiq load balancer to manage the bonding. Of course, that introduces a new single point of failure. Then you can double-up the Elfiqs in a active/passive cluster. Then you notice that they're both on the same power circuit so you get a separate power drop to your cabinet. Then you notice that the two circuits are on the same grid...

...and then you realize that there is never going to be a time when no SPOF exists, so you simply have to transfer that SPOF to another person so you can blame them, or to a device / system that is sufficiently amazing to make you sleep at night. Until you realize that your developers don't sanity check the application's inputs.


What you want isn't DNS Round Robin, it's redundant connectivity (something your colo provider should already be giving you, at least at their edge. If they don't have multiple redundant uplinks and proper routing configured to fail over if one link goes away find a new colocation facility).

If you have a single firewall / network uplink and that single point of failure is unacceptable to you it's time to invest in redundant firewalls and redundant links to your ISP's core (preferably through different access switches). Pretty much any commercial firewall worthy of the name can do this. You can even do it with free firewalls if you're on a budget.