BIND SERVFAIL after upgrade to Debian Jessie
Solution 1:
This one is a real pain to troubleshoot if you aren't familiar with the new max-recursion-queries
option or why it was added.
CVE-2014-8500 was identified in late 2014 as impacting multiple nameserver products, including BIND. The exploit allows a malicious nameservers to craft a chain of referrals that will be followed infinitely, eventually leading to resource exhaustion. ISC's fix for this issue was to add an upper limit on how many levels of recursion the server is willing to perform on behalf of a single query. The ceiling is controlled by a new max-recursion-queries
option that defaults to 75.
As it turns out 75 levels of recursion is not very friendly to an empty nameserver cache -- which you will always have after a full process restart. There are many domains that will fail to resolve with this default due to how many levels of referrals end up being traversed between a requested record and .
(root). The pandion.im.
domain happens to be one of those, and it probably has something to do with the glueless delegation from the TLD. Here's an excerpt from dig +trace +additional pandion.im
:
im. 172800 IN NS ns4.ja.net.
im. 172800 IN NS hoppy.iom.com.
im. 172800 IN NS barney.advsys.co.uk.
im. 172800 IN NS pebbles.iom.com.
ns4.ja.net. 172800 IN A 193.62.157.66
hoppy.iom.com. 172800 IN A 217.23.163.140
barney.advsys.co.uk. 172800 IN A 217.23.160.50
pebbles.iom.com. 172800 IN A 80.168.83.242
ns4.ja.net. 172800 IN AAAA 2001:630:0:47::42
;; Received 226 bytes from 199.7.83.42#53(199.7.83.42) in 29 ms
pandion.im. 259200 IN NS ed.ns.cloudflare.com.
pandion.im. 259200 IN NS jill.ns.cloudflare.com.
;; Received 81 bytes from 80.168.83.242#53(80.168.83.242) in 98 ms
The nameservers for im.
are delegating pandion.im.
to Cloudflare's nameservers without providing IP address glue. On an empty cache, this means that the server has to initiate a separate referral traversal to obtain the IP address of those nameservers, and all of those referrals count against the maximum number of recursions for the original query. At that point the query will only succeed if the server already knows the IP addresses of those nameservers from other queries:
# service named restart && sleep 1 && dig @localhost pandion.im | grep status
Checking named config:
Stopping named: [ OK ]
Starting named: [ OK ]
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 63173
Trying again, this time with attempts to look up those nameservers before pandion.im.
:
# service named restart && sleep 1 && dig @localhost ed.ns.cloudflare.com jill.ns.cloudflare.com pandion.im | grep status
Checking named config:
Stopping named: [ OK ]
Starting named: [ OK ]
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 26428
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30491
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22162
Long story short, this problem is very non-intuitive to identify, especially since it will seem to eventually "go away" over time if the process is left running. One of our partners has recommended a value of 200 based on real world usage scenarios. Start with 200, and season to taste if it's too high for your liking.