Linux keeps retrying failed DNS server

Whenever one of the servers in /etc/resolv.conf is unreachable, Linux/glibc/whatever isn't smart enough not to retry it for a while. This results in a lot of services becoming unavailable, because a lot of them do reverse lookups on all incoming connections (like SSH), which will hang for the time-out of the first DNS server query.

How can I make my Ubuntu boxes be smart about the DNS servers it uses? I could hack a bash script that runs every minute that inserts a REJECT rule into iptables for the servers that don't respond to dig queries, but I'd rather not do it that way...

I'm told that Windows does this properly, BTW.

Edit: I worked around it a little bit by putting this in /etc/resolv.conf (or /etc/resolvconf/resolv.conf.d/base):

options timeout:2 rotate

Still not perfect, but more workable.

Solution 1:

Why are the DNS servers becoming unavailable? That's the issue we should focus on fixing...

You should omit the rotate directive if you want to have a deterministic retry order. rotate basically gives you round-robin lookups, which can have undesirable results in your situation.

My DNS /etc/resolv.conf tends to look like:

search blah.net client.blah.net
options timeout 1
nameserver 172.16.2.14
nameserver 172.16.2.18

Short of that, you do have the option of using a caching DNS service on your local machine, or even enabling the Name Server Caching Daemon (nscd). That will help buffer the delays that come with unreliable DNS resolvers.

Solution 2:

Ugh. I've come across this same problem in my systems. When the primary DNS server goes offline, the entire system becomes incredibly slow at best.

In fact, I asked a similar question on this quite some time ago: DNS/resolv.conf settings for a Primary DNS Server failure?. There were some really good answers there, that you might find useful.

I wound up just editing /etc/resolv.conf with lower timeout values. (options timeout:1) Largely because it was the easiest workaround, rather than the most effective. This change means the servers spend less time waiting for dead resolvers. Lookups take 2 seconds rather than 10. This is still terrible if you're trying to do anything that isn't a batch, but at least resulted in very few service failures.

Create tomcat war file from static web dir

How do you force a process to flush the data written to an open file descriptor under Linux?

VMware vSphere DRS affinity rules for more than 2 guests

Is it possible for Wireshark to drop packets purposely?

When a samba server shares NFS4 mount directory, window clients cannot copy file. ERROR: 0x80070021

Getting auditd to record the original user

Why does /proc/cpuinfo show contradicting processor speeds?

Configure Windows Firewall to block all except for specific traffic

Change VNC listen address in Libvirt without restarting VM

What (if any) are the risks of renaming and domain-joining a machine at the same time?

How can I handle big traffic spikes on my dedicated server?

HIbernate issue with Oracle Trigger for generating id from a sequence