Temporary failure in name resolution affecting most public traffic

We're having a bizarre issue with one of our domains that has been going on for the past 15+ hours at this point. The domain is not resolvable in many parts of the country. I tried pinging it on a few different Linux servers and got:

example.com: Temporary failure in name resolution

The problem isn't geographically limited though. People all over the country have now told me they can't ping the domain either. However, some people can ping the domain and access it normally.

Finally, somebody pointed me to this site: https://dnschecker.org/ip-blacklist-checker.php

When I try resolving the domain, "Blacklisted?" is No for all of them except for this one: dnsbl.spfbl.net

When I click the error, it says this:

No rDNS was found.

This IP has been flagged because have none valid FCrDNS.

Register a valid rDNS for this IP, which points to the same IP.

The rDNS must be registered under your own domain for you be able to delist it.

I can't think what has caused this or how to fix it. The domain's A records don't point to a static IP as far as I know, since it goes through the Cloudflare load balancers. I thought it might have been an issue with Cloudflare, which we use for our nameservers. However, their portal didn't indicate anything at all yesterday or anything relevant this morning.

The only fix we've found is to change the preferred DNS server manually. For instance, Google and OpenDNS don't resolve the domain. On dnsblacklist.org, 7 out of 12 DNS resolvers fail to resolve it. Cloudflare and 4 other resolvers can successfully resolve it. If we change the preferred DNS server to 1.1.1.1 manually on the servers, now they can ping the domain.

The problem is that all of our public users, obviously, are not going to do this. Any thoughts on what exactly is going on? I haven't changed anything with the domain records at all recently, and this doesn't affect any of my other domains which also use Cloudflare as the nameserver. The issue affects all subdomains on this domain as well as the primary domain, regardless of whether traffic on that address is proxied through Cloudflare.

UPDATE:

Another domain:

enter image description here

Problematic domain:

enter image description here


Solution 1:

If you go to https://dnsviz.net/d/phreaknet.org/X44olg/dnssec/ you will find, right now, no less than 6 bogus errors and 3 errors on your domain. Said differently: it is horribly broken at the DNS layer:

DNSViz report for phreaknet.org on 2020-10-20 00:00:22 UTC

I don't know which services you used to check, but obviously they were oblivious to DNSSEC problems, hence they are not good. Use dnsviz.net, it is tried and trusted.

As said in comments, DNSSEC is often a source of "it works there but not there" because if it is broken it will be seen as broken only by validating resolvers, which are not all resolvers out there. Plus there are many edge cases in DNSSEC that can make one resolver accept the answer and not the other.

You will need to fix all of these problems.

Start by completely removing DNSSEC off it, unless you are sure to master it. Which means go to its currently sponsoring registrar (Namecheap as seen in whois) and find the way to remove DS records at the registry for your domain.

Once this is done you will need to wait. How much?

.ORG authoritative nameservers publish your DS record with a TTL of one day:

$ dig org. NS +short
b2.org.afilias-nst.org.
b0.org.afilias-nst.org.
a2.org.afilias-nst.info.
a0.org.afilias-nst.info.
d0.org.afilias-nst.org.
c0.org.afilias-nst.info.
$ dig @d0.org.afilias-nst.org. DS phreaknet.org +noall +ans
phreaknet.org.      1d IN DS 2371 13 2 (
                1AE4B3ECD5282A0A412E6792B60855C39D9166F94703
                980CCECB740EAAD501A9 )

So you need to wait for at least one day after the moment where this record disappears (which can happen some time after you ask your registrar to remove it).

Once the DS record has disappeared for at least 1 day, then do a DNSviz check again and see if you have other problems. But it is probably with just that you will have solved all your problems already.

Alternatively, you should ask your DNS provider to help you for DNS related problems on your domain. Especially since the core problem is: "RRSIG phreaknet.org/DNSKEY alg 13, id 2371: The Signature Expiration field of the RRSIG RR (2020-10-18 21:55:18+00:00) is 1 day in the past." which means an error by your DNS provider not updating signatures on your domain, or by yourself if you were instructed to change the DS record at registry to use another key by your DNS provider and you didn't do it. Expiration of this signature recently explains why you started to get reports of problems around one day ago.

If your DNS provider can regenerate valid signatures, with the current key, on all records, since things will again work without having to remove the DS at registry. But you will also need to wait, albeit far less as signatures' TTLs seems to be either 1 hour or 5 minutes:

$ dig phreaknet.org SOA +dnssec +cd +noall +ans
phreaknet.org.      1h IN SOA donna.ns.cloudflare.com. dns.cloudflare.com. (
                2035213272 ; serial
                10000      ; refresh (2 hours 46 minutes 40 seconds)
                2400       ; retry (40 minutes)
                604800     ; expire (1 week)
                3600       ; minimum (1 hour)
                )
phreaknet.org.      1h IN RRSIG SOA 13 2 3600 (
                20201021011353 20201018231353 34505 phreaknet.org.
                bxVvIlM7M5tlKDLsRUSj2LSpNrvkv4DlZnzi+AfKFkP7
                1GuzfyOJmNlpQK+eD8j+kigLcBXUEkbO66rY76QWtA== )
$ dig phreaknet.org A +dnssec +cd +noall +ans
phreaknet.org.      5m IN A 104.18.57.190
phreaknet.org.      5m IN A 104.18.56.190
phreaknet.org.      5m IN A 172.67.175.181
phreaknet.org.      5m IN RRSIG A 13 2 300 (
                20201021011314 20201018231314 34505 phreaknet.org.
                BCZjG6JZfJqERRBOZ9DP50OBXzUhCdQ777EoElWv52oU
                xRRWw+I1e/ok3a5V2h1gt/OBLVVwlRFQhALk2rclSA== )

Aside useful note about the above:

  • you need +dnssec to see the RRSIG records and hence their TTL (otherwise they are not shown by default because never useful per se... except to troubleshoot DNSSEC related problems)
  • you absolutely need +cd which is the flag that requests NOT to do DNSSEC validations, otherwise those queries will probably just get back SERVFAIL because the DNSSEC configure is broken. +cd is the most important flag of dig for DNSSEC: "do a query without it and get SERVFAIL then redo same query with it and get a reply" means in 99.999% of cases that your problem is solely related to DNSSEC configuration on your domain.