Why is geo-redundant DNS necessary for small sites?
This is a Canonical Question about DNS geo-redundancy.
It's extremely common knowledge that geo-redundant DNS servers located at separate physical locations are highly desirable when providing resilient web services. This is covered in-depth by document BCP 16, but some of the most frequently mentioned reasons include:
Protection against datacenter disasters. Earthquakes happen. Fires happen in racks and take out nearby servers and network equipment. Multiple DNS servers won't do you much good if physical problems at the datacenter knock out both DNS servers at once, even if they're not in the same row.
Protection against upstream peer problems. Multiple DNS servers won't prevent problems if a shared upstream network peer takes a dirt nap. Whether the upstream problem completely takes you offline, or simply isolates all of your DNS servers from a fraction of your userbase, the end result is that people can't access your domain even if the services themselves are located in a completely different datacenter.
That's all well and good, but are redundant DNS servers really necessary if I'm running all of my services off of the same IP address? I can't see how having a second DNS server would provide me any benefit if no one can get to anything provided by my domain anyway.
I understand that this is considered a best practice, but this really seems pointless!
Solution 1:
Note: Content in dispute, refer to comments for both answers. Errors have been found and this Q&A is in need of an overhaul.
I'm removing the accept from this answer for the time being until the state of this canonical Q&A is properly addressed. (deleting this answer would also delete the attached comments, which isn't the way to go IMO. probably going to turn it into a community wiki answer after extensive editing.)
I could quote RFCs here and use technical terms, but this is a concept that gets missed by a lot of people on both ends of the knowledge spectrum and I'm going to try to answer this for the broader audience.
I understand that this is considered a best practice, but this really seems pointless!
It may seem pointless...but it's actually not!
Recursive servers are very good at remembering when remote servers do not respond to a query, particularly when they retry and still never see a reply. Many implement negative caching of these communication failures, and will temporarily put unresponsive nameservers in the penalty box for a period of time no greater than five minutes. Eventually this "penalty" period expires and they will resume communication. If the bad query fails again they go right back into the box, otherwise it's back to business as usual.
This is where we run into the single nameserver problem:
- The penalty period is by nature of implementation always going to be greater than or equal to the duration of the network problem. In almost all cases it will be greater, to a maximum of an additional five minutes.
- If your single DNS server goes into the penalty box, the query associated with the failure is going to be completely dead for the full duration.
- Brief routing interruptions happen on the internet more than most people realize. TCP/IP retransmissions and similar application safeguards do a good job of hiding this from the user, but it's somewhat unavoidable. The internet does a good job of routing around this damage for the most part due to safeguards built into the various standards that support the network stack...but that also includes the ones built into DNS, and having geo-redundant DNS servers is a part of that.
Long story short, if you go with a single DNS server (this includes using the same IP address multiple times across NS
records), this is going to happen. It's also going to happen a lot more than you realize, but the problem will be so sporadic that the odds of the failure 1) being reported to you, 2) being reproduced, and 3) being tied to this specific problem are extremely close to zero. They pretty much were zero if you came into this Q&A not knowing how this process worked, but thankfully that shouldn't be the case now!
Should this bother you? It's not really my place to say. Some people won't care about this five minute interruption problem at all, and I'm not here to convince you of that. What I am here to convince you is that you do in fact sacrifice something by only running a single DNS server, and in all scenarios.