I have dnsmasq installed on server and it is heavily used.I have following dnsmasq log in the /var/log/syslog when given the signal "SIGUSR1"

Jul 16 13:45:50 server1 dnsmasq[427008]: time 1531748750
Jul 16 13:45:50 server1 dnsmasq[427008]: cache size 10000, 18355/597070 cache insertions re-used unexpired cache entries.
Jul 16 13:45:50 server1 dnsmasq[427008]: queries forwarded 1510313, queries answered locally 15347110
Jul 16 13:45:50 server1 dnsmasq[427008]: queries for authoritative zones 0
Jul 16 13:45:50 server1 dnsmasq[427008]: server 100.1.2.3#53: queries sent 242802, retried or failed 0
Jul 16 13:45:50 server1 dnsmasq[427008]: server 100.2.3.4#53: queries sent 1371704, retried or failed 0

I have set cache size to 10000 which is max allowed value. The application which uses dnsmasq for request also feels slow. How to understand the dnsmasq log and know if these logs represent performance issue due to low cache size ?


Solution 1:

I should stress before this answer that I'm not an expert in dnsmasq in particular, but the logs combined with the man page do seem fairly clear:

When it receives a SIGUSR1, dnsmasq writes statistics to the system log. It writes the cache size, the number of names which have had to removed from the cache before they expired in order to make room for new names and the total number of names that have been inserted into the cache. The number of cache hits and misses and the number of authoritative queries answered are also given. For each upstream server it gives the number of queries sent, and the number which resulted in an error. In --no-daemon mode or when full logging is enabled (-q), a complete dump of the contents of the cache is made.

So given your line:

cache size 10000, 18355/597070 cache insertions re-used unexpired cache entries.

This line is a little confusing, but given the text above I take it to mean 18355 entries had to be removed before expiry to make room for new entries, and 597070 is the total number of insertions. Given that both these numbers are far larger than your cache, a larger cache would probably be of some benefit, however the developers would have limited it to 10000 for good reason. It probably wouldn't scale well even if you could increase it, so if you really do need a larger cache, you may be using the wrong tool.

Jul 16 13:45:50 server1 dnsmasq[427008]: queries forwarded 1510313, queries answered locally 15347110

The majority of your queries are being answered locally (by a factor of 10), which is good, however these numbers are far greater than the cache utilisation. To me this suggests that a lot of your requests are not cachable for some reason.

This could be because a lot of the recordshave a TTL of zero for example (DHCP records for example frequently have zero ttls).

On the other hand, the one in 10 requests that has to be fetched from a remote server could explain the perceived sluggishness. It's impossible to tell from the information given though.

Also, even if all your queries were cachable, you appear to have full cache ultilization anyway given the number of entries that were evicted before expiry (although this depends on the workload, a "bursty" pattern could cause a high number of evictions).

Jul 16 13:45:50 server1 dnsmasq[427008]: server 100.1.2.3#53: queries sent 242802, retried or failed 0

Jul 16 13:45:50 server1 dnsmasq[427008]: server 100.2.3.4#53: queries sent 1371704, retried or failed 0

These lines indicate that 100.2.3.4 is getting many more requests from your dnsmasq server than 100.1.2.3. It's probably not the cause of the problem, but interesting none the less.

As a side note, if those are the addresses you are actually using, and they are LAN addresses, you might want to consider changing them. 100.0.0.0/8 is a publicly routable block, so you could have a confused router somewhere trying to route your DNS queries externally. Even if you put static routes in all your routers it's still bad practice, as a user on your network wouldn't be able to access anything on the internet in that range.