How does DNS nameserver fall back work?

We have two DNS servers listed in our NS record. Last night, one of our DNS servers went down. As expected, some DNS servers were not resolving our hostnames. I assumed this would be temporary and would start working once the TTL of our NS records would expire (1 hour).

An hour+ later, I was still getting DNS timeouts from desktops that were using Earthlink, Verizon and OpenDNS severs. I tested to see if the other DNS server was answering:

dig @ns2.example.com www.example.com +short

This worked.

My questions:

  1. Does anyone have an answer as to why other DNS servers were not hitting our other DNS server even after the TTL expired?
  2. Do DNS servers prefer a domain's main DNS server (from the SOA record)?
  3. Is there any algorithm used to pick a nameserver from the available NS records? I'm assuming this is implementation specific but perhaps there are some standards that apply here.

Solution 1:

This is an unfortunate irritation. Multiple DNS servers are supposed to be to increase reliability, but in practice it frequently has the reverse effect.

The problem is that the client only waits so long for a response, and the server waits about that same amount of time. Say you have two DNS servers, A and B. Say A is working and B has failed. This happens:

  1. Client connects to name server Z and asks it for the information. Z chooses B and sends a query.

  2. The client times out because name server Z did not respond.

  3. Client tries name server Y. Y chooses B and sends a query.

  4. Name server Z times out and tries A. It gets the right answer, but the client isn't waiting any more.

  5. The client times out because name server Y did not respond.

  6. The client gives up, having both its name servers fail to respond.

  7. Name server Y times out and tries A. It get the right answer, but the client isn't waiting any more.

And there's no good solution. The longer you wait to see if a nameserver replies, the longer you need to wait because the name server you are waiting for itself waits longer. Arguably, the problem was that Y and Z didn't give up on B fast enough.

Essentially, if any of your name servers are out, some clients will, through sheer bad luck, time out because they tried only the bad ones.

On the bright side, if you have two nameservers and one fails, about 75% of name servers will get an answer, instead of 0%.