Windows DNS servers repeatedly requesting records in zone when they get SERVFAIL response

We're seeing high levels (over 2000 requests/second) of DNS queries from our caching DNS servers to external servers. This may have been happening for a long time - this came to light recently because of performance problems with our firewall. Talking to colleagues at other institutions it's clear that we're making more queries than they are.

My initial thought was that the problem was lack of caching of SERVFAIL responses. Having done more investigation it's clear that the problem is a high level of requests for the failing record from the Windows DNS servers. It seems that in our environment a single query to one of the Windows DNS servers for a record from a zone which returns SERVFAIL results in a stream of requests for that record from all of the Windows DNS servers. The stream of requests doesn't stop until I add a fake empty zone on one of the Bind servers.

My plan tomorrow is to verify the configuration of the Windows DNS servers - they should just be forwarding to the caching Bind servers. I figure we must have something wrong there as I can't believe that no-one else has hit this if it's not a misconfiguration. I'll update this question after that (possibly closing this one and opening a new, clearer one).


Our setup is a pair of caching servers running Bind 9.3.6 which are used either directly by clients or via our Windows domain controllers. The caching servers pass queries to our main DNS servers which are running 9.8.4-P2 - these servers are authoritative for our domains and pass queries for other domains to external servers.

Behaviour we're seeing is that queries like the one below aren't being cached. I've verified this by looking at network traffic from the DNS servers using tcpdump.

 [root@dns1 named]# dig ptr 119.49.194.173.in-addr.arpa.

 ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5_8.6 <<>> ptr 119.49.194.173.in-addr.arpa.
 ;; global options:  printcmd
 ;; Got answer:
 ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 8680
 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

 ;; QUESTION SECTION:
 ;119.49.194.173.in-addr.arpa.   IN      PTR

 ;; Query time: 950 msec
 ;; SERVER: 127.0.0.1#53(127.0.0.1)
 ;; WHEN: Sun Mar  9 13:34:20 2014
 ;; MSG SIZE  rcvd: 45

Querying google's server directly shows that we're getting a REFUSED response.

[root@dns1 named]# dig ptr 119.49.194.173.in-addr.arpa. @ns4.google.com.

; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5_8.6 <<>> ptr 119.49.194.173.in-addr.arpa. @ns4.google.com.
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 38825
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;119.49.194.173.in-addr.arpa.   IN      PTR

;; Query time: 91 msec
;; SERVER: 216.239.38.10#53(216.239.38.10)
;; WHEN: Sun Mar  9 13:36:38 2014
;; MSG SIZE  rcvd: 45

This isn't just happening with google addresses or reverse lookups but a high proportion of the queries are for those ranges (I suspect because of a Sophos reporting feature).

Should our DNS servers be caching these negative responses? I read http://tools.ietf.org/rfcmarkup?doc=2308 but didn't see anything about REFUSED. We don't specify lame-ttl in config file so I'd expect that to default to 10 minutes.

I believe this (the lack of caching) is expected behaviour. I don't understand why the other sites I've talked to aren't seeing the same thing. I've tried a test server running the latest stable version of Bind and that shows the same behaviour. I also tried Unbound and that didn't cache SERVFAIL either. There's some discussion of doing this in djbdns here but conclusion is that the functionality has been removed.

Are there settings in the Bind config that we could change to influence this behaviour? lame-ttl didn't help (and we were running with default anyway).

As part of investigation I've added some fake empty zones on our caching DNS servers to cover the ranges leading to most requests. That's dropped the number of requests to external servers but isn't sustainable (and feels wrong as well). In parallel with this I've asked a colleague to get logs from the Windows DNS servers so that we can identify the clients making the original requests.


Solution 1:

The relevant part of RFC2308 is §7.1 Server Failure (OPTIONAL).

In either case a resolver MAY cache a server failure response. If it does so it MUST NOT cache it for longer than five (5) minutes, and it MUST be cached against the specific query tuple <query name, type, class, server IP address>.

I'm not aware of a simple configuration directive that might override this, though if you were so inclined you could forward that zone elsewhere or serve it directly.

If it's directly causing firewall problems you should check the UDP pseudo-connection timeouts, cacheing of DNS UDP can fill a state table if it's high. DNS queries tend to block, so I hope you're not doing (m)any of those on the firewall.

Some of the reverse zones for 173.194/16 seem broken. They should at worst return cacheable NXDOMAINs rather than SERVFAIL or REFUSED.

$ dig 194.173.in-addr.arpa. ns +short
NS4.GOOGLE.COM.
NS3.GOOGLE.COM.
NS2.GOOGLE.COM.
NS1.GOOGLE.COM.

$ dig @ns4.google.com 119.49.194.173.in-addr.arpa. ns
; <<>> DiG 9.8.4-P4 <<>> @ns4.google.com 119.49.194.173.in-addr.arpa. ns
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 63925
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

194.173.in-addr.arpa is delegated by ARIN to Google:

$ dig @z.arin.net 194.173.in-addr.arpa. ns +auth

;; AUTHORITY SECTION:
194.173.in-addr.arpa.   86400   IN      NS      NS4.GOOGLE.COM.
194.173.in-addr.arpa.   86400   IN      NS      NS1.GOOGLE.COM.
194.173.in-addr.arpa.   86400   IN      NS      NS2.GOOGLE.COM.
194.173.in-addr.arpa.   86400   IN      NS      NS3.GOOGLE.COM.

But those name servers don't play ball, all four return SERVFAIL for

$ dig @ns4.google.com 194.173.in-addr.arpa. soa

Aside from being "rude", this used to violate ARIN policy, but no longer does. But other zones work, try 46.194.173.in-addr.arpa. or 65.194.173.in-addr.arpa. so it seems deliberate and selective.