DNS/resolv.conf settings for a Primary DNS Server failure?

I'm currently the administrator of some RHEL Linux machines, in a mixed network. Our DNS servers are Windows AD controllers. As such, they occasionally need to come down for maintenance. (eg: patching) This means that at some point, the primary DNS controller for my Linux machines will be unreachable.

In the Windows world, this is handled pretty well. When DNS queries to the primary fail, Windows clients stop using it for 15 minutes. So, barring the initial hiccup, they all putt along pretty smoothly. But Linux keeps trying the same (failed) primary server. By default it will wait at least 5 seconds before trying a secondary server. This translates into EVERYTHING taking a long time, and even applications timing out if there are a good number of DNS lookups.

So, I'm looking into making my server more robust. My current plan is to A) modify resolv.conf to only wait 1/2 a second for a response, and not retry. and B) possibly make some strategic entries to /etc/hosts so that major servers are still reachable quickly.

All that being said, I'd love to have a better solution. Alternately, I'd like to hear what other people are doing with their setups. Or just theoretical "Your idea is good/bad, here's why."


--Christopher Karel


Solution 1:

You might look at using dnsmasq instead of relying solely on the resolver library - dnsmasq queries the upstream servers in parallel, not a serial fashion, so having one drop out shouldn't cause so many problems.

Solution 2:

Maybe running an nscd and adding

options rotate

to /etc/resolv.conf already does the trick for you.

Solution 3:

An easier solution is to redirect the traffic for a certain time (maintenance window).

If you have a spare machine, you could give it temporary the ip of your primary server. Otherwise you could deploy the redirection in the router. If a packet has as destination your primary server you can redirect it to your secondary server