Caching DNS returns SERVFAIL for NS record, but dig +trace disagrees?

This question is similar, but doesn't elaborate on the confusing case of a why a NS record cannot be obtained.

One of our caching DNS environments (RHEL 5.8, BIND 9.3.6-20.P1.el5_8.4) has ceased to return any useful data at all for a zone. Usually this sort of problem ends up being a stale NS or glue record, but in this particular case I can't seem to even get the cache to report a NS record for the zone.

dig @mycache somedomain NS returns SERVFAIL. There are no nameserver records cached at all.
dig +trace shows a healthy delegation path, with the final nameserver returning a response. Manually running the dig query against the final nameserver returns a valid NS record, the corresponding A record exists and agrees with the glue, etc.

What gives? Why is there no NS record for me to obtain from the DNS cache, not even a bad one?

If there's no authoritative answer for a NS record, then there's nothing to cache other than the failure to determine the authority. This is what has been cached, and a server's in-memory information about lame nameservers cannot be obtained by a DNS client. (or rather, this is as close as you're going to get)

Usually you can identify a problem with stale nameserver records by comparing the NS record in cache to what you find on the internet, but in this case there is no authoritative NS record to to cache. Glue records are not authoritative in and of themselves; with no authoritative answer, there is simply no authoritative nameserver.

One of two things is usually happening here:

dig +trace is getting a stale answer for an intermediate nameserver from your local cache, and there really is a problem going on at the moment. I've covered this behavior in another question.
The caching server encountered NXDOMAIN or SERVFAIL when chasing glue records to find an authoritative nameserver, and this event has been cached. Even if the problem has been corrected, or the glue has been pointed somewhere else, the nameserver isn't going to try asking for it again until an internal timer expires. Requesting a cache purge for the zone in question will usually reset it.

The latter case is usually the culprit. If you want to be absolutely sure, it may be possible to dump your nameserver's runtime cache and view the glue in memory. (i.e. BIND's rndc dumpdb) Be advised that this is a very expensive operation unless you can limit the scope of the dump to a single zone, and generally something to be avoided in high load scenarios.

Caching DNS returns SERVFAIL for NS record, but dig +trace disagrees?

Related

Recent Posts