Real-world impact of partial authoritative DNS outage

Following the Dyn outage on Friday we are considering adding a secondary authoritative DNS provider. We would like to understand the real-world impact should one of the providers have an outage.

For example, if our NS records were to look like

ns1.provider-a.com
ns1.provider-b.com
ns2.provider-a.com
ns2.provider-b.com

and either provider-a or provider-b experienced an outage, what would users experience in the worst case (no cache)? I would expect something like increased latency getting a valid response (should the resolver first attempt to reach a downed server), or perhaps a resolution failure 50% of the time. If the behavior is implementation-dependent, any understanding of what the spread of various behaviors is would be very helpful.


In short, it should function the way you need it to.

Authoritative DNS is designed to be fast and fault tolerant. Recursive resolvers are written to get a valid authoritative response from your pool of servers as quickly as possible, which includes the assumption that one or more may be slow, unresponsible, or misconfigured (SERVFAIL responses). One or more unusable servers can cause a slight (negligible) overhead in obtaining an answer, but once that answer has been obtained it can be cached for the duration of time specified in that record's TTL. Only the users who made the request when the record was not cached would see the small delay, and the other requests would be answered immediately.

Negative caching of communication failures is optional and frequently implemented (see RFC 2308 §7), but will not yield much in the way of backoff. The failures can only be stored for a maximum of five minutes, and can only be remembered per query. (<query name, type, class, server IP address>) As stated earlier, this should not present a problem, and I mention this detail mostly to avoid confusion.

The biggest problem you are going to have is synchronization. You must monitor all of these authoritative servers for the serial number falling out of sync. Recursive resolvers are going to trust the first of your servers that returns an authoritative response. If one server returns NXDOMAIN but the others do not, the non-existence of that record can be cached much longer than five minutes depending on how your SOA record is configured.


To summarize, it's very important that you know the difference between negative caching of unresponsive/misconfigured servers and properly responding servers. Servers which are functional and responding, but serving a stale copy of the zone, can and will do far more damage than their non-functional counterparts in this configuration. If you can avoid falling into that trap, the new configuration should be solid in your proposed failure scenario.

(caveat: I am assuming that Provider A and Provider B are both geo-redundant providers who know what they are doing. Anyone intending to take one of these roles in-house should read BCP 16 in full and ensure that they have a DNS expert in their employ. A server admin who has read a book about it once is playing with fire.)