Can I improve this regex check for valid domain names?
Solution 1:
Please, please, please don't use a fixed and horribly complicated regex like this to match for known domain names.
The list of TLDs is not static, particularly with ICANN looking at a streamlined process for new gTLDs. Even the list of ccTLDs changes sometimes!
Have a look at the list available from http://publicsuffix.org/ and write some code that's able to download and parse that list instead.
Solution 2:
Download this: http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Example usage (in Python):
import re
def validate(domain):
valid_domains = [ line.upper().replace('.', '\.').strip()
for line in open('domains.txt')
if line[0] != '#' ]
r = re.compile(r'^[A-Z0-9\-]{2,63}\.(%s)$' % ('|'.join(valid_domains),))
return True if r.match(domain.upper()) else False
print validate('stackoverflow.com')
print validate('omnom.nom')
You can factor the domain-list-building out of the validate function to help performance.