Correct way to set up DNS primary/secondary/... for redundancy and latency reduction?

I thought DNS primary/secondary for redundancy purposes was straightforward. My understanding is that you should have a primary and at least one secondary, and that you should set up your secondary in a geographically different location, but also behind a different router (see for example https://serverfault.com/questions/48087/why-are-there-several-nameservers-for-my-domain)

Currently, we have two name servers both in our main data center. Recently, we've suffered some outages for various reasons that took out both name servers, and left us and our customers without working DNS for a few hours. I've asked my sysadmin team to finish setting up a DNS server in another data center and configure it as the secondary name server.

However, our sysadmins claim that this doesn't help much if the other data center is not at least as dependable as the primary data center. They claim that most clients will still fail to look up properly, or time out too long, when the primary data center is down.

Personally, I'm convinced we're not the only company with this kind of problem and that it most likely is already a solved problem. I can't imagine all those internet companies being affected by our kind of problem. However, I can't find good online docs that explain what happens in failure cases (for example, client timeouts) and how to work around them.

What arguments can I use to poke holes in our sysadmins' reasoning ? Any online resources I can consult to better understand the problems they claim exist ?

Some additional notes after reading the replies:

we're on Linux
we have additional complicated DNS needs; our DNS entries are managed by some custom software, with BIND currently slaving from a Twisted DNS implementation, and some views in the mix as well. However we're completely capable of setting up our own DNS servers at another data center.
I'm talking about authoritative DNS for outsiders to find our servers, not recursive DNS servers for our local clients.

There is a really great, albeit quite technical "Best Practices" document that may prove useful when combating your sysadmin. http://www.cisco.com/web/about/security/intelligence/dns-bcp.html

If he/she doesn't recognize the validity of articles written by Cisco, then you might as well stop arguing with the sysadmin - go up a level of management.

Many other "Best Practices" document recommend separating your primary and secondary nameservers not only by IP block, but by physical location. In fact, RFC 2182 recomends that secondary DNS services be geographically separated. For many companies, this means renting a server in another datacenter, or subscribing to a hosted DNS provider such as ZoneEdit or UltraDNS.

However, our sysadmins claim that this doesn't help much if the other data center is not at least as dependable as the primary data center. They claim that most clients will still fail to look up properly, or time out too long, when the primary data center is down.

Ah, the focus is dependable. It sounds like they are taking a jab at your link to the outside, rather than setting up secondary DNS. All the same, do set up secondary DNS and proceed from there. It will help with the load and will prop things up in a pinch...but do inquire as to why they think the other location is not dependable.

Personally, I'm convinced we're not the only company with this kind of problem and that it most likely is already a solved problem. I can't imagine all those internet companies being affected by our kind of problem.

You're not the only company, and this has probably been rehashed a million times in companies the world over.

However, I can't find good online docs that explain what happens in failure cases (for example, client timeouts) and how to work around them.

What arguments can I use to poke holes in our sysadmins' reasoning ? Any online resources I can consult to better understand the problems they claim exist ?

I'm talking about authoritative DNS for outsiders to find our servers, not recursive DNS servers for our local clients.

You can do all kinds of things, including setting up an external DNS service that is registered as the authority for your zone, but secretly making the (outside) authoritative servers secondaries to your own (inside) DNS servers. This configuration is horrible, wrong, shows that I am truly an evil SysAdmin, and a kitten dies every time I recommend it. But it does two things:

You get your DNS service to handle the brunt of the load, rendering questions about the capacity of your own (internal) DNS as moot.
You get your DNS service to stay up while your in-house DNS servers may be down, so it doesn't matter how dependable your link is - what matters is how dependable your DNS service provider is.

The reasons that this is the wrong thing to do:

You would be setting up what is called a "stealth nameserver", because while it will show up in your zone records, and you can query the IP for the name of the server, it will never be touched by the outside. Client queries will never reach it.
While your DNS would continue to operate fine (because your hosted service would address the issue) it doesn't mean that any websites you have would work if your internet connection was down, that is to say, it only addresses half of the issue. It really does sound like there are other issues that the admins are concerned about.

Unfortunately the Linux DNS resolver doesn't seem to have direct support for detecting and doing failovers for DNS servers. It keeps feeding requests to your primary resolving nameserver, waits for a configured timeout, attempts again, etc.

This often means up to 30s delays for any request. Without first trying the secondary as long as the primary is down.

I wanted to solve this as our Amazon EC2 resolving nameserver is unreachable for many of our workers. This causes big delays in our processes and even downtime in some cases because we rely on resolution. I wanted a good failover to Google / Level3 nameservers in case Amazon's went down again. And fall back ASAP, because then Amazon will resolve hostnames to local addresses where applicable, resolving in lower latency for instance to instance communication.

But whatever the usecase, there's a need for better failover. I wanted to solve this. I wanted to stay away from proxy-ing daemons, services, etc. As that would just introduce more Single Point Of Failures. I wanted to use as archaic & robust a technology as I could.

I decided to use crontab & bash, and wrote nsfailover.sh. Hope this helps.

It sounds like the problem is that clients—which could be anyone, anywhere—see two DNS servers and if one fails, they either do not failover to the secondary server or there is a long timeout before they do.

I agree that the primary and secondary DNS servers should be located at different facilities as a best practice, but I don’t see how that would fix this particular problem.

If the client is going to insist on querying a specific IP address, ignoring the secondary’s IP address (or taking a while to timeout to it), then you simply have to come up with a solution that keeps that IP address working, even if the primary server is down.

Some directions to explore would be a load balancer that can redirect traffic for a single IP address to multiple servers at different data centers; or perhaps anycast routing.

Correct way to set up DNS primary/secondary/... for redundancy and latency reduction?

Related

Recent Posts