Avoiding DNS timeouts when a DNSserver fails

We have a small datacenter with about a hundred hosts pointing to 3 internal DNS servers (bind 9). Our problem comes when one of the internal DNS servers becomes unavailable. At that point all the clients that point to that server start performing very slowly.

The problem seems to be that the stock Linux resolver doesn't really have the concept of "failing over" to a different DNS server. You can adjust the timeout and number of retries it uses, (and set rotate so it will work through the list), but no matter what settings one uses our services perform much more slowly if a primary DNS server becomes unavailable. At the moment this is one of the largest sources of service disruptions for us.

My ideal answer would be something like "RTFM: tweak /etc/resolv.conf like this...", but if that's an option I haven't seen it.

I was wondering how other folks handled this issue?

I can see 3 possible types of solutions:

Use linux-ha/Pacemaker and failover IPs (so the DNS IP VIPs are "always" available). Alas, we don't have a good fencing infrastructure, and without fencing pacemaker doesn't work very well (in my experience Pacemaker lowers availability without fencing).
Run a local DNS server on each node, and have resolv.conf point to localhost. This would work, but it would give us a lot more services to monitor and manage.
Run a local cache on each node. Folks seem to consider nscd "broken", but dnrd seems to have the right feature set: it marks DNS servers as up or down, and won't use 'down' DNS servers.

Any-casting seems to work only at the IP routing level, and depends on route updates for server failure. Multi-casting seemed like it would be a perfect answer, but bind does not support broadcasting or multi-casting, and the docs I could find seem to suggest that multicast DNS is more aimed at service discovery and auto-configuration rather than regular DNS resolving.

Am I missing an obvious solution?

A couple of options. Both will distribute the DNS load across your DNS servers.

Try using options rotate in resolv.conf. This will minimize the impact of the primary server being down. If one of the other servers is down, it will slow down actions.
Use a different nameserver order on different clients. This will allow some clients to run normally if the primary DNS server is down. This spreads the impact of an out of service DNS server around.

These options can be combined with options timeout:1 attempts:5. Increase the attempts if you decrease timeout so you can handle slow external servers.

Depending on your router configuration you may be able to configure your DNS servers to take over the primary DNS server's IP address when it is down. This can be combined with the above techniques.

NOTE: I run years without unscheduled DNS outages. As others have noted, I would work on solving the issues causing the DNS servers to fail. The above steps, also help with misconfigured DNS servers with specifying unreachable name servers.

Check out "man resolv.conf". You can add a timeout option to the resolv.conf. The default is 5, but adding the following to resolv.conf should bring it down to 1 second: