Random DNS resolution outages, no idea what to try next. (20.04.02)

For the past 3 months, I have been struggling with a random issue on my homeserver where DNS resolution drops for a brief period of time (10-60 seconds) for absolutely no reason. Pinging via hostname results in ping: signal.org: Temporary failure in name resolution, and any services that attempt a DNS lookup fail near instantly. There are no systemd-resolved or dnsmasq logs in /var/log/syslog when these outages happen, but other services will report issues. For example:

ddclient[573749]: message repeated 14 times: [ WARNING: cannot connect to checkip.dyndns.org:80 socket: IO::Socket::INET: Bad hostname 'checkip.dyndns.org']

dockerd[1811]: time="2021-04-29T13:50:19.080258289-05:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"

rsyslogd: DNS error: Can't resolve "<local_domain>" [v8.2001.0]

whoopsie[1816]: [17:38:15] Sent; server replied with: Couldn't resolve host name

Current setup: Ubuntu 20.04.2, Netplan set to static IP, dnsmasq is the DNS server, with dns-forward-max=1024, systemd-resolved is disabled and stopped. Server is a Ryzen 3950X, 64GB RAM, OS is installed on an NVMe drive. The server runs many webapp-type services, but the nosiest for DNS requests is easily matrix-synapse.

Things I have tried:

· I have restarted the systemd-resolved service hundreds of times, disabled the service a dozen times, turned off/on the stub resolver, and deleted and re-created the symlink.

· I set a static IP with netplan, and played with /etc/NetworkManager/NetworkManager.conf.

· I Installed pihole and unbound via apt for just the server itself. (pihole is currently uninstalled, and unbound is running but nothing is using it to resolve.

· I Installed dnsmasq and completely disabled systemd-resolved.

· I've disabled IPv6 completely on the server.

· I've set * soft nofile 1048576 and * hard nofile 1048576 in /etc/security/limits.conf, and /proc/sys/fs/file-max shows 9223372036854775807.

I suspect Docker is the issue, but I have no idea how to verify this. I've currently got 38 Docker containers running, and when I run sudo lsof -i :53 while the issue is happening, I will see:

thomcat@servername:~$ sudo lsof -i :53
COMMAND      PID            USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
dockerd     1623            root  217u  IPv4 1577888      0t0  UDP localhost:46003->localhost:domain 
dockerd     1623            root  226u  IPv4 1605902      0t0  UDP localhost:50192->localhost:domain 
dockerd     1623            root  227u  IPv4 1610070      0t0  UDP localhost:52637->localhost:domain 
dockerd     1623            root  228u  IPv4 1605907      0t0  UDP localhost:55021->localhost:domain 
dockerd     1623            root  229u  IPv4 1618981      0t0  UDP localhost:57618->localhost:domain 
dockerd     1623            root  230u  IPv4 1610081      0t0  UDP localhost:35776->localhost:domain 
dockerd     1623            root  231u  IPv4 1610086      0t0  UDP localhost:60635->localhost:domain 
dockerd     1623            root  232u  IPv4 1589998      0t0  UDP localhost:43036->localhost:domain 
dockerd     1623            root  234u  IPv4 1602056      0t0  UDP localhost:58408->localhost:domain 
dockerd     1623            root  235u  IPv4 1614011      0t0  UDP localhost:43421->localhost:domain 
dockerd     1623            root  236u  IPv4 1589999      0t0  UDP localhost:60957->localhost:domain 
dockerd     1623            root  237u  IPv4 1597695      0t0  UDP localhost:53026->localhost:domain 
dockerd     1623            root  242u  IPv4 1590000      0t0  UDP localhost:41842->localhost:domain 
dockerd     1623            root  244u  IPv4 1597696      0t0  UDP localhost:49179->localhost:domain 
dockerd     1623            root  246u  IPv4 1572736      0t0  UDP localhost:46471->localhost:domain 
dockerd     1623            root  266u  IPv4 1616008      0t0  UDP localhost:35262->localhost:domain 
dockerd     1623            root  267u  IPv4 1616009      0t0  UDP localhost:54501->localhost:domain 
dockerd     1623            root  268u  IPv4 1579887      0t0  UDP localhost:33130->localhost:domain 
dockerd     1623            root  269u  IPv4 1579888      0t0  UDP localhost:33491->localhost:domain 
dockerd     1623            root  270u  IPv4 1613280      0t0  UDP localhost:49504->localhost:domain 
dockerd     1623            root  273u  IPv4 1579890      0t0  UDP localhost:43801->localhost:domain 
dockerd     1623            root  278u  IPv4 1613283      0t0  UDP localhost:44804->localhost:domain 
dockerd     1623            root  279u  IPv4 1568692      0t0  UDP localhost:39425->localhost:domain 
dockerd     1623            root  293u  IPv4 1577890      0t0  UDP localhost:52194->localhost:domain 
dockerd     1623            root  296u  IPv4 1605903      0t0  UDP localhost:50866->localhost:domain 
dockerd     1623            root  319u  IPv4 1605904      0t0  UDP localhost:58574->localhost:domain 
dockerd     1623            root  341u  IPv4 1605910      0t0  UDP localhost:37123->localhost:domain 
dockerd     1623            root  342u  IPv4 1610067      0t0  UDP localhost:48734->localhost:domain 
dockerd     1623            root  343u  IPv4 1610069      0t0  UDP localhost:35580->localhost:domain 
dockerd     1623            root  344u  IPv4 1605905      0t0  UDP localhost:45133->localhost:domain 
dockerd     1623            root  345u  IPv4 1618982      0t0  UDP localhost:53052->localhost:domain 
dockerd     1623            root  346u  IPv4 1589996      0t0  UDP localhost:56714->localhost:domain 
dockerd     1623            root  347u  IPv4 1614009      0t0  UDP localhost:37216->localhost:domain 
dockerd     1623            root  348u  IPv4 1589997      0t0  UDP localhost:38032->localhost:domain 
dockerd     1623            root  349u  IPv4 1618984      0t0  UDP localhost:53714->localhost:domain 
dockerd     1623            root  350u  IPv4 1610084      0t0  UDP localhost:42922->localhost:domain 
dockerd     1623            root  351u  IPv4 1577893      0t0  UDP localhost:32865->localhost:domain 
dockerd     1623            root  352u  IPv4 1608975      0t0  UDP localhost:58307->localhost:domain 
dockerd     1623            root  353u  IPv4 1597699      0t0  UDP localhost:33564->localhost:domain 
dockerd     1623            root  354u  IPv4 1608977      0t0  UDP localhost:58235->localhost:domain 
dockerd     1623            root  355u  IPv4 1577896      0t0  UDP localhost:46068->localhost:domain 
dockerd     1623            root  356u  IPv4 1597702      0t0  UDP localhost:32827->localhost:domain 
systemd-r 106795 systemd-resolve   12u  IPv4  980615      0t0  UDP localhost:domain 
systemd-r 106795 systemd-resolve   13u  IPv4  980616      0t0  TCP localhost:domain (LISTEN)
http      165553            _apt    3u  IPv4 1611999      0t0  UDP localhost:54478->localhost:domain 

More things to note:

· The upstream DNS server is a Raspberry Pi 3 B+ running pihole. Nothing else on my network has these DNS resolution problems, so the problem is not with the pihole.

· ssh sessions to the server do not drop when this issue is happening.

· pinging external IPs works just fine when the issue is happening.

 

I've been pulling my hair out trying to figure this out. If anyone has any ideas, I would be glad to hear them.


Hello and welcome to the forum!

It might be worth to consider the possibility that your problem is not really with DNS, it could be that your network connection is just unreliable for whatever reason, that some packets get lost. The fact that ssh sessions survive does not tell us much because ssh session use TCP so that can work even if many packets are lost. DNS on the other hand uses UDP, so if some UDP packets (incoming or outgoing) are lost then that could lead to DNS failures.

Testing using ping as you did is a good idea, might be worth doing more of that, for example pinging more often that once per second to see if there is sometimes packet loss and if the timing of that correlates with the DNS problems. Also worth noting that ping involves a different protocol (ICMP), it could happen that there is some problem with UDP but that ping works anyway. Then it might be worth to run tests using UDP, that can be done for example using the iperf3 tool.

If you control the DNS server yourself then you could monitor that to see if the DNS request arrives or not, and to see if a reply is sent from the DNS server.

You could also try monitoring incoming and outgoing network traffic locally using something like tcpdump or tshark to verify that the DNS request gets sent, and to check if the reply from the DNS server can be seen.

Anyway those are some ideas, I hope some of it can be useful. Good luck!


TL;DR: Make sure your Pi-Hole isn't rate-limiting your requests.

Today, I finally Google'd "pihole rate limit", and low and behold this recent blog post mentioned:

...we decided to implement a customizable rate-limiting into FTL itself. It defaults to the rather conservative limit of allowing no more than 1000 queries in a 60 seconds window for each client.

I was beside myself and had completely missed this news. I've opened a feature request with Pi-Hole to get a log entry added for when this happens, hopefully to keep a future home sysadmin from pulling their hair out.

1,000 queries in 60 seconds might sound like a lot, but with 38 active Docker containers (and especially Watchtower and matrix-synapse) those get filled up in a hurry.