How do I understand my CPU usage on a DNS server?

I have read and understood Can you help me with my capacity planning?, but I'm not sure I understand what my next steps are in a DNS server scenario. I think my CPU loads are high or that I might be starting to drop queries, but I'd like to better understand the load of my server before I take action on it. This is particularly concerning to me because it's common knowledge that scaling your infrastructure to DDoS loads is losing battle.

What should I be analyzing in order to understand my environment?


On Serverfault, we usually tell you that we can't help with your capacity planning. This is for good reason: we don't know the specifics of your environment, and the answers on how to measure it are pretty much the same. Unfortunately, DNS capacity measurement is a poorly understood topic and most admins will assume that high CPU usage means that it's time to consider adding capacity. This is a really, really bad idea, and scaling to a DNS DDoS will inevitably lead to your network devices choking. (or worse, people reaching out to your legal department)

Server logs and packet captures are what most admins will try leveraging first, but the simple truth is that SNMP can tell you far more about the environment than what your logs do. Don't ignore logs and packet captures, but SNMP can usually help you spot the existence of a problem faster.

In addition to tracking the default system stats provided by your SNMP monitoring tool (which should include CPU load, per-interface throughput and packet counters, disk I/O, etc.), I recommend adding the following OIDs:

  • UDP-MIB
    • Receive queue errors: udpInErrors (angry red color strongly recommended)
    • UDP datagram counters: udpInDatagrams, udpOutDatagrams
    • (optional) Unexpected datagrams: udpNoPorts
  • TCP-MIB
    • TCP segment counters: tcpInSegs, tcpOutSegs

Interpreting the graphs

These graphs can be lumped into two categories: metrics that indicate a problem, and metrics that help you diagnose it.

Indicators

  • High CPU utilization is bad. This is a given, but when it happens you need to look for other metrics to correlate it to. If high CPU utilization maps to a spike in outbound network utilization (either throughput or number of packets), chances are pretty good that you're being used in a DDoS attack. Specifics on how to interpret the nature of the attack are in the following section.
  • udpInErrors is your primary sign of a capacity problem. This counter increments every time the kernel drops a UDP datagram because the application isn't processing traffic fast enough. This means that your DNS service is overloaded and not able to keep up with the incoming traffic.
    • Most network performance guides will tell you that increasing the size of your receive queue is not the correct solution: they're usually right. Try to find a reason explaining why the server is overloaded, either by looking at the other graphs or analyzing logs.
    • If your company operates mail servers that use DNSBL tables, keep in mind that snowshoe attacks can create brief spikes in legitimate DNS traffic that can exhaust space in your receive queues. This is one of the cases where it might be worthwhile to increase the size of your socket receive queues (since it's a known temporary condition), but generally it's better to throw more hardware at the problem to keep your latency down.

If you cannot correlate increases in these metrics to other performance problems on the system, congratulations: you are legitimately nearing/over capacity and it's time to add servers. Consider me impressed. :)

Diagnosing

This covers DNS specific items only. Use your head here, and don't expect this to be all-inclusive. (example: disk I/O saturation is not a problem specific to DNS)

  • On a busy recursive server, outbound throughput should remain in the neighborhood of 2x your input. This is because replies are usually much larger than the associated queries. Sustained spikes that are significantly above this level indicate that your server is participating in an amplification attack. You are most likely operating an open resolver.
  • Packets in should be roughly equal to packets out, even on a recursive DNS server. While there will be an occasional need for a query to be retransmitted due to a timeout, this does not happen so often that it will cause a significant graph skew. A significant increase in outgoing packets indicates a network problem, or that your cluster is being used in an attack against authoritative nameservers. This does not necessarily suggest that you are operating an open resolver: other DNS servers might be forwarding queries to you that can't be cached.
  • It may seem redundant that I suggest graphing UDP+TCP I/O in addition to the per-interface graphs, but these graphs are not tied to interfaces and also give you insight into the nature of the attack in progress when you have enough experience with your nameserver software.

Side note: udpNoPorts isn't really a capacity metric, but it's useful for identifying cache poisoning attempts. This counter increments every time a UDP packet was seen on an unexpected port, and a sustained wall of these during normal operation can suggest that someone is trying to forge a reply. (either that, or one of your listeners isn't running: turn that back on foo'!)