Fully qualified domain name validation
Is there a quick and dirty way to validate if the correct FQDN has been entered? Keep in mind there is no DNS server or Internet connection, so validation has to be done via regex/awk/sed.
Any ideas?
Solution 1:
(?=^.{4,253}$)(^((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63}$)
regex is always going to be at best an approximation for things like this, and rules change over time. the above regex was written with the following in mind and is specific to hostnames-
Hostnames are composed of a series of labels concatenated with dots. Each label is 1 to 63 characters long, and may contain:
- the ASCII letters a-z (in a case insensitive manner),
- the digits 0-9,
- and the hyphen ('-').
Additionally:
- labels cannot start or end with hyphens (RFC 952)
- labels can start with numbers (RFC 1123)
- max length of ascii hostname including dots is 253 characters (not counting trailing dot) (http://blogs.msdn.com/b/oldnewthing/archive/2012/04/12/10292868.aspx)
- underscores are not allowed in hostnames (but are allowed in other DNS types)
some assumptions:
- TLD is at least 2 characters and only a-z
- we want at least 1 level above TLD
results: valid / invalid
- 911.gov - valid
- 911 - invalid (no TLD)
- a-.com - invalid
- -a.com - invalid
- a.com - valid
- a.66 - invalid
- my_host.com - invalid (undescore)
- typical-hostname33.whatever.co.uk - valid
EDIT: John Rix provided an alternative hack of the regex to make the specification of a TLD optional:
(?=^.{1,253}$)(^(((?!-)[a-zA-Z0-9-]{1,63}(?<!-))|((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63})$)
- 911 - valid
- 911.gov - valid
EDIT 2:
someone asked for a version that works in js.
the reason it doesn't work in js is because js does not support regex look behind.
specifically, the code (?<!-)
- which specifies that the previous character cannot be a hyphen.
anyway, here it is rewritten without the lookbehind - a little uglier but not much
(?=^.{4,253}$)(^((?!-)[a-zA-Z0-9-]{0,62}[a-zA-Z0-9]\.)+[a-zA-Z]{2,63}$)
you could likewise make a similar replacement on John Rix's version.
EDIT 3: if you want to allow trailing dots - which is technically allowed:
(?=^.{4,253}\.?$)(^((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63}\.?$)
I wasn't familiar with trailing dot syntax till @ChaimKut pointed them out and I did some research
- http://dns-sd.org./TrailingDotsInDomainNames.html
- https://jdebp.eu./FGA/web-fully-qualified-domain-name.html
Using trailing dots however seems to cause somewhat unpredictable results in the various tools I played with so I would be advise some caution.
Solution 2:
It's harder nowadays, with internationalized domain names and several thousand (!) new TLDs.
The easy part is that you can still split the components on ".".
You need a list of registerable TLDs. There's a site for that:
https://publicsuffix.org/list/effective_tld_names.dat
You only need to check the ICANN-recognized ones. Note that a registerable TLD can have more than one component, such as "co.uk".
Then there's IDN and punycode. Domains are Unicode now. For example,
"xn--nnx388a" is equivalent to "臺灣". Both of those are valid TLDs, incidentally.
For punycode conversion code, see "http://golang.org/src/pkg/net/http/cookiejar/punycode.go".
Checking the syntax of each domain component has new rules, too. See RFC5890 at https://www.rfc-editor.org/rfc/rfc5890
Components can be either A-labels (ASCII only) or Unicode. ASCII labels either follow the old syntax, or begin "xn--", in which case they are a punycode version of a Unicode string.
The rules for Unicode are very complex, and are given in RFC5890. The rules are designed to prevent such things as mixing characters from left-to-right and right-to-left sets.
Sorry there's no easy answer.
Solution 3:
This regex is what you want:
(?=^.{1,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)
It match your example domain (groupa-zone1appserver.example.com or cod.eu etc...)
I'll try to explain:
(?=^.{1,254}$)
matches domain names (that can begin with any char) that are long between 1 and 254 char, it could be also 5,254 if we assume co.uk is the minimum length.
(^
starting match
(?:
define a matching group
(?!\d+\.)
the domain name should not be composed by numbers, so 1234.co.uk or abc.123.uk aren't accepted while 1a.ko.uk yes.
[a-zA-Z0-9_\-]
the domain names should be composed by words with only a-zA-Z0-9_-
{1,63}
the length of any domain level is maximum 63 char, (it could be 2,63)
+
and
(?:[a-zA-Z]{2,})$)
the final part of the domain name should not be followed by any other word and must be composed of a word minimum of 2 char a-zA-Z
Solution 4:
We use this regex to validate domains which occur in the wild. It covers all practical use cases I know of. New ones are welcome. According to our guidelines it avoids non-capturing groups and greedy matching.
^(?!.*?_.*?)(?!(?:[\w]+?\.)?\-[\w\.\-]*?)(?![\w]+?\-\.(?:[\w\.\-]+?))(?=[\w])(?=[\w\.\-]*?\.+[\w\.\-]*?)(?![\w\.\-]{254})(?!(?:\.?[\w\-\.]*?[\w\-]{64,}\.)+?)[\w\.\-]+?(?<![\w\-\.]*?\.[\d]+?)(?<=[\w\-]{2,})(?<![\w\-]{25})$
Proof and explanation: https://regex101.com/r/FLA9Bv/40
There're two approaches to choose from when validating domains.
By-the-books FQDN matching (theoretical definition, rarely encountered in practice):
- max 253 character long (as per RFC-1035/3.1, RFC-2181/11)
- max 63 character long per label (as per RFC-1035/3.1, RFC-2181/11)
- any characters are allowed (as per RFC-2181/11)
- TLDs cannot be all-numeric (as per RFC-3696/2)
- FQDNs can be written in a complete form, which includes the root zone (the trailing dot)
Practical / conservative FQDN matching (practical definition, expected and supported in practice):
- by-the-books matching with the following exceptions/additions
- valid characters:
[a-zA-Z0-9.-]
- labels cannot start or end with hyphens (as per RFC-952 and RFC-1123/2.1)
- TLD min length is 2 character, max length is 24 character as per currently existing records
- don't match trailing dot
The regex above contains both by-the-books and practical rules.
Solution 5:
CONSIDERATION #1:
Please note that due to relaxed requirements in RFC-2181 DNS labels can consist of pretty much any combination of symbols (however, the length restrictions are still there):
"Any binary string whatever can be used as the label of any resource record. Implementations of the DNS protocols must not place any restrictions on the labels that can be used. In particular, DNS servers must not refuse to serve a zone because it contains labels that might not be acceptable to some DNS client programs." (https://www.rfc-editor.org/rfc/rfc2181#section-11)
CONSIDERATION #2:
"There is an additional rule that essentially requires that top-level domain names not be all-numeric" (https://www.rfc-editor.org/rfc/rfc3696#section-2)
Taking into account these two considerations, the correct regex looks like this:
/^(?!:\/\/)(?=.{1,255}$)((.{1,63}\.){1,127}(?![0-9]*$)[a-z0-9-]+\.?)$/i
See demo @ http://regexr.com/3g5j0