BIND - Slow Query Responses Isolated To Single Interface
Solution 1:
This problem was caused by the iowait being maxed out on the server. It was running consistently at 100% with kjournald as the service causing it.
Thanks to a suggestion from Andrew B, I begun investigating UDP packet errors using netstat -su | grep errors
. From this, I could see that it was shooting up by 30 - 50 packets every second roughly. This led me to check the buffer for UDP per socket by running netstat -uanp
. From this, I was able to confirm that the random delays and occasional timeouts (drops) were occurring due to the buffer being full. I found out the buffer was full by analysing the value in the Recv-Q column for the BIND service listening on the IP / port in question.
After identifying the buffer was full, there wasn't much point in increasing it as it would no doubt become saturated once more. Instead, as the CPU load and RAM looked okay, I started wondering if disk operations could be causing a bottle-neck for UDP packets being dealt with. This was confirmed by running the commmand top
and analysing the iowait value.
Once I identified that the CPU was waiting nearly 100% of the time for io operations to complete, I started using tools such as iotop
to find what was writing to disk. Turned out the journal system for the ext3 file-system was generating all of the waiting. This led me to think that perhaps it was extreme amounts of logging on the server that could be causing the saturation, as I knew the /var/log/messages
file was receiving a ton of denied query logs every second.
Testing the above theory, I added the following line to named.conf within the logging area. This line disables logging for approve / deny messages related to received queries. There is a log per query put into /var/log/messages
, which can be a lot if you get barraged by clients:
category security { null; };
Fortunately, after restarting BIND, I could see a dramatic drop in iowait percentage. Testing queries, I was able to confirm that they are being answered well within a tenth of a second now; a dramatic improvement on before.
In hindsight I should have checked the iowait time initially. Hope this helps anyone experiencing similar issues. I'll now be looking into controlling the logging a little more and seeing what I can do about these denied messages.