Random DNS resolution outages, no idea what to try next. (20.04.02)
For the past 3 months, I have been struggling with a random issue on my homeserver where DNS resolution drops for a brief period of time (10-60 seconds) for absolutely no reason. Pinging via hostname results in ping: signal.org: Temporary failure in name resolution
, and any services that attempt a DNS lookup fail near instantly. There are no systemd-resolved
or dnsmasq
logs in /var/log/syslog
when these outages happen, but other services will report issues. For example:
ddclient[573749]: message repeated 14 times: [ WARNING: cannot connect to checkip.dyndns.org:80 socket: IO::Socket::INET: Bad hostname 'checkip.dyndns.org']
dockerd[1811]: time="2021-04-29T13:50:19.080258289-05:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
rsyslogd: DNS error: Can't resolve "<local_domain>" [v8.2001.0]
whoopsie[1816]: [17:38:15] Sent; server replied with: Couldn't resolve host name
Current setup: Ubuntu 20.04.2, Netplan
set to static IP, dnsmasq
is the DNS server, with dns-forward-max=1024
, systemd-resolved
is disabled and stopped. Server is a Ryzen 3950X, 64GB RAM, OS is installed on an NVMe drive. The server runs many webapp-type services, but the nosiest for DNS requests is easily matrix-synapse
.
Things I have tried:
· I have restarted the systemd-resolved
service hundreds of times, disabled the service a dozen times, turned off/on the stub resolver, and deleted and re-created the symlink.
· I set a static IP with netplan
, and played with /etc/NetworkManager/NetworkManager.conf.
· I Installed pihole
and unbound
via apt
for just the server itself. (pihole
is currently uninstalled, and unbound
is running but nothing is using it to resolve.
· I Installed dnsmasq
and completely disabled systemd-resolved
.
· I've disabled IPv6 completely on the server.
· I've set * soft nofile 1048576
and * hard nofile 1048576
in /etc/security/limits.conf, and /proc/sys/fs/file-max
shows 9223372036854775807
.
I suspect Docker is the issue, but I have no idea how to verify this. I've currently got 38 Docker containers running, and when I run sudo lsof -i :53
while the issue is happening, I will see:
thomcat@servername:~$ sudo lsof -i :53
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dockerd 1623 root 217u IPv4 1577888 0t0 UDP localhost:46003->localhost:domain
dockerd 1623 root 226u IPv4 1605902 0t0 UDP localhost:50192->localhost:domain
dockerd 1623 root 227u IPv4 1610070 0t0 UDP localhost:52637->localhost:domain
dockerd 1623 root 228u IPv4 1605907 0t0 UDP localhost:55021->localhost:domain
dockerd 1623 root 229u IPv4 1618981 0t0 UDP localhost:57618->localhost:domain
dockerd 1623 root 230u IPv4 1610081 0t0 UDP localhost:35776->localhost:domain
dockerd 1623 root 231u IPv4 1610086 0t0 UDP localhost:60635->localhost:domain
dockerd 1623 root 232u IPv4 1589998 0t0 UDP localhost:43036->localhost:domain
dockerd 1623 root 234u IPv4 1602056 0t0 UDP localhost:58408->localhost:domain
dockerd 1623 root 235u IPv4 1614011 0t0 UDP localhost:43421->localhost:domain
dockerd 1623 root 236u IPv4 1589999 0t0 UDP localhost:60957->localhost:domain
dockerd 1623 root 237u IPv4 1597695 0t0 UDP localhost:53026->localhost:domain
dockerd 1623 root 242u IPv4 1590000 0t0 UDP localhost:41842->localhost:domain
dockerd 1623 root 244u IPv4 1597696 0t0 UDP localhost:49179->localhost:domain
dockerd 1623 root 246u IPv4 1572736 0t0 UDP localhost:46471->localhost:domain
dockerd 1623 root 266u IPv4 1616008 0t0 UDP localhost:35262->localhost:domain
dockerd 1623 root 267u IPv4 1616009 0t0 UDP localhost:54501->localhost:domain
dockerd 1623 root 268u IPv4 1579887 0t0 UDP localhost:33130->localhost:domain
dockerd 1623 root 269u IPv4 1579888 0t0 UDP localhost:33491->localhost:domain
dockerd 1623 root 270u IPv4 1613280 0t0 UDP localhost:49504->localhost:domain
dockerd 1623 root 273u IPv4 1579890 0t0 UDP localhost:43801->localhost:domain
dockerd 1623 root 278u IPv4 1613283 0t0 UDP localhost:44804->localhost:domain
dockerd 1623 root 279u IPv4 1568692 0t0 UDP localhost:39425->localhost:domain
dockerd 1623 root 293u IPv4 1577890 0t0 UDP localhost:52194->localhost:domain
dockerd 1623 root 296u IPv4 1605903 0t0 UDP localhost:50866->localhost:domain
dockerd 1623 root 319u IPv4 1605904 0t0 UDP localhost:58574->localhost:domain
dockerd 1623 root 341u IPv4 1605910 0t0 UDP localhost:37123->localhost:domain
dockerd 1623 root 342u IPv4 1610067 0t0 UDP localhost:48734->localhost:domain
dockerd 1623 root 343u IPv4 1610069 0t0 UDP localhost:35580->localhost:domain
dockerd 1623 root 344u IPv4 1605905 0t0 UDP localhost:45133->localhost:domain
dockerd 1623 root 345u IPv4 1618982 0t0 UDP localhost:53052->localhost:domain
dockerd 1623 root 346u IPv4 1589996 0t0 UDP localhost:56714->localhost:domain
dockerd 1623 root 347u IPv4 1614009 0t0 UDP localhost:37216->localhost:domain
dockerd 1623 root 348u IPv4 1589997 0t0 UDP localhost:38032->localhost:domain
dockerd 1623 root 349u IPv4 1618984 0t0 UDP localhost:53714->localhost:domain
dockerd 1623 root 350u IPv4 1610084 0t0 UDP localhost:42922->localhost:domain
dockerd 1623 root 351u IPv4 1577893 0t0 UDP localhost:32865->localhost:domain
dockerd 1623 root 352u IPv4 1608975 0t0 UDP localhost:58307->localhost:domain
dockerd 1623 root 353u IPv4 1597699 0t0 UDP localhost:33564->localhost:domain
dockerd 1623 root 354u IPv4 1608977 0t0 UDP localhost:58235->localhost:domain
dockerd 1623 root 355u IPv4 1577896 0t0 UDP localhost:46068->localhost:domain
dockerd 1623 root 356u IPv4 1597702 0t0 UDP localhost:32827->localhost:domain
systemd-r 106795 systemd-resolve 12u IPv4 980615 0t0 UDP localhost:domain
systemd-r 106795 systemd-resolve 13u IPv4 980616 0t0 TCP localhost:domain (LISTEN)
http 165553 _apt 3u IPv4 1611999 0t0 UDP localhost:54478->localhost:domain
More things to note:
· The upstream DNS server is a Raspberry Pi 3 B+ running pihole. Nothing else on my network has these DNS resolution problems, so the problem is not with the pihole.
· ssh
sessions to the server do not drop when this issue is happening.
· ping
ing external IPs works just fine when the issue is happening.
I've been pulling my hair out trying to figure this out. If anyone has any ideas, I would be glad to hear them.
Hello and welcome to the forum!
It might be worth to consider the possibility that your problem is not really with DNS, it could be that your network connection is just unreliable for whatever reason, that some packets get lost. The fact that ssh sessions survive does not tell us much because ssh session use TCP so that can work even if many packets are lost. DNS on the other hand uses UDP, so if some UDP packets (incoming or outgoing) are lost then that could lead to DNS failures.
Testing using ping as you did is a good idea, might be worth doing more of that, for example pinging more often that once per second to see if there is sometimes packet loss and if the timing of that correlates with the DNS problems. Also worth noting that ping involves a different protocol (ICMP), it could happen that there is some problem with UDP but that ping works anyway. Then it might be worth to run tests using UDP, that can be done for example using the iperf3
tool.
If you control the DNS server yourself then you could monitor that to see if the DNS request arrives or not, and to see if a reply is sent from the DNS server.
You could also try monitoring incoming and outgoing network traffic locally using something like tcpdump
or tshark
to verify that the DNS request gets sent, and to check if the reply from the DNS server can be seen.
Anyway those are some ideas, I hope some of it can be useful. Good luck!
TL;DR: Make sure your Pi-Hole isn't rate-limiting your requests.
Today, I finally Google'd "pihole rate limit", and low and behold this recent blog post mentioned:
...we decided to implement a customizable rate-limiting into FTL itself. It defaults to the rather conservative limit of allowing no more than 1000 queries in a 60 seconds window for each client.
I was beside myself and had completely missed this news. I've opened a feature request with Pi-Hole to get a log entry added for when this happens, hopefully to keep a future home sysadmin from pulling their hair out.
1,000 queries in 60 seconds might sound like a lot, but with 38 active Docker containers (and especially Watchtower and matrix-synapse
) those get filled up in a hurry.