Availability at risk due to one offline Domain Name Server?

Lastly, why would clients not by default question the next name server in the list when one is down?

That is exactly what recursive servers do when talking to authoritative servers. RFC 1035 §7.2 describes the overall process if you're interested, but the following excerpts are the most immediately relevant:

The key algorithm uses the state information of the request to select the next name server address to query, and also computes a timeout which will cause the next action should a response not arrive. The next action will usually be a transmission to some other server, but may be a temporary error to the client.

[snip]

  • If a resolver gets a server error or other bizarre response from a name server, it should remove it from SLIST, and may wish to schedule an immediate transmission to the next candidate server address.

There are a few other factors considered in the selection of the authoritative server, such as the observed response time based on prior communication history. It's there in the RFC if you're interested.

The key to ensuring that you are not impacted by nameserver unreachability is covered by BCP 16. In particular, Section 3.1 states:

Secondary servers must be placed at both topologically and geographically dispersed locations on the Internet, to minimise the likelihood of a single failure disabling all of them.

That is, secondary servers should be at geographically distant locations, so it is unlikely that events like power loss, etc, will disrupt all of them simultaneously. They should also be connected to the net via quite diverse paths. This means that the failure of any one link, or of routing within some segment of the network (such as a service provider) will not make all of the servers unreachable.

This is to account for the fact that the resiliency of your domain is severely impacted by single points of failure on the network, or on the physical site. The ideal state is to have multiple authoritative nameservers that are not impacted by any change in network or physical state experienced by the others.


I would say that the answer to the overall sentiment of the question is "no".

First off, the client machine traditionally only has a stub resolver, blindly sending all queries (with "recursion desired" set) to some configured nameserver address (resolv.conf).

It's really what happens in the next step, when that nameserver processes the recursion request, making iterative queries until it reaches the authority, that your question is applicable.

And while there is some degree of implementation specific behavior, it's absolutely the case that it is expected to try to work itself through the authoritative nameservers until it finds one which is responsive.
The caveat here is rather that there will be some overall timeout, so there is a risk that it cannot finish in time.
That said it's also common to keep tabs of which servers are working and which aren't, increasing the chances that successive queries will succeed in a timely fashion, and of course queries for already cached data will not even require communication with the authoritative servers.

All in all, no, you should not expect 50% chance of user-visible error if there are two nameservers and one is down. More likely the first lookup in a completely cold-cache scenario will just be slightly slow.