Linux keeps retrying failed DNS server
Whenever one of the servers in /etc/resolv.conf
is unreachable, Linux/glibc/whatever isn't smart enough not to retry it for a while. This results in a lot of services becoming unavailable, because a lot of them do reverse lookups on all incoming connections (like SSH), which will hang for the time-out of the first DNS server query.
How can I make my Ubuntu boxes be smart about the DNS servers it uses? I could hack a bash script that runs every minute that inserts a REJECT rule into iptables for the servers that don't respond to dig queries, but I'd rather not do it that way...
I'm told that Windows does this properly, BTW.
Edit: I worked around it a little bit by putting this in /etc/resolv.conf
(or /etc/resolvconf/resolv.conf.d/base
):
options timeout:2 rotate
Still not perfect, but more workable.
Solution 1:
Why are the DNS servers becoming unavailable? That's the issue we should focus on fixing...
You should omit the rotate
directive if you want to have a deterministic retry order. rotate
basically gives you round-robin lookups, which can have undesirable results in your situation.
My DNS /etc/resolv.conf
tends to look like:
search blah.net client.blah.net
options timeout 1
nameserver 172.16.2.14
nameserver 172.16.2.18
Short of that, you do have the option of using a caching DNS service on your local machine, or even enabling the Name Server Caching Daemon (nscd). That will help buffer the delays that come with unreliable DNS resolvers.
Solution 2:
Ugh. I've come across this same problem in my systems. When the primary DNS server goes offline, the entire system becomes incredibly slow at best.
In fact, I asked a similar question on this quite some time ago: DNS/resolv.conf settings for a Primary DNS Server failure?. There were some really good answers there, that you might find useful.
I wound up just editing /etc/resolv.conf
with lower timeout values. (options timeout:1
) Largely because it was the easiest workaround, rather than the most effective. This change means the servers spend less time waiting for dead resolvers. Lookups take 2 seconds rather than 10. This is still terrible if you're trying to do anything that isn't a batch, but at least resulted in very few service failures.