Get just the domain name from a URL?
Yes, it is possible use:
Uri.GetLeftPart( UriPartial.Authority )
@Dewfy: flaw is that your method returns "uk" for "www.test.co.uk" but the domain here is clearly "test.co.uk".
@naivists: flaw is that your method returns "beta.microsoft.com" for "www.beta.microsoft.com" but the domain here is clearly "microsoft.com"
I needed the same, so I wrote a class that you can copy and paste into your solution. It uses a hard coded string array of tld's. http://pastebin.com/raw.php?i=VY3DCNhp
Console.WriteLine(GetDomain.GetDomainFromUrl("http://www.beta.microsoft.com/path/page.htm"));
outputs microsoft.com
and
Console.WriteLine(GetDomain.GetDomainFromUrl("http://www.beta.microsoft.co.uk/path/page.htm"));
outputs microsoft.co.uk
I tried pretty much every approach but all of them fell short of the desired result. So here is my approach adjusted from servermanfail.
The tld file is available on https://publicsuffix.org/list/ I have taken the file from https://publicsuffix.org/list/effective_tld_names.dat parse it and search for the tld's. If new tld's are published just download the latest file.
have fun.
using System;
using System.Collections.Generic;
using System.IO;
namespace SearchWebsite
{
internal class NetDomain
{
static public string GetDomainFromUrl(string Url)
{
return GetDomainFromUrl(new Uri(Url));
}
static public string GetDomainFromUrl(string Url, bool Strict)
{
return GetDomainFromUrl(new Uri(Url), Strict);
}
static public string GetDomainFromUrl(Uri Url)
{
return GetDomainFromUrl(Url, false);
}
static public string GetDomainFromUrl(Uri Url, bool Strict)
{
initializeTLD();
if (Url == null) return null;
var dotBits = Url.Host.Split('.');
if (dotBits.Length == 1) return Url.Host; //eg http://localhost/blah.php = "localhost"
if (dotBits.Length == 2) return Url.Host; //eg http://blah.co/blah.php = "localhost"
string bestMatch = "";
foreach (var tld in DOMAINS)
{
if (Url.Host.EndsWith(tld, StringComparison.InvariantCultureIgnoreCase))
{
if (tld.Length > bestMatch.Length) bestMatch = tld;
}
}
if (string.IsNullOrEmpty(bestMatch))
return Url.Host; //eg http://domain.com/blah = "domain.com"
//add the domain name onto tld
string[] bestBits = bestMatch.Split('.');
string[] inputBits = Url.Host.Split('.');
int getLastBits = bestBits.Length + 1;
bestMatch = "";
for (int c = inputBits.Length - getLastBits; c < inputBits.Length; c++)
{
if (bestMatch.Length > 0) bestMatch += ".";
bestMatch += inputBits[c];
}
return bestMatch;
}
static private void initializeTLD()
{
if (DOMAINS.Count > 0) return;
string line;
StreamReader reader = File.OpenText("effective_tld_names.dat");
while ((line = reader.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(line) && !line.StartsWith("//"))
{
DOMAINS.Add(line);
}
}
reader.Close();
}
// This file was taken from https://publicsuffix.org/list/effective_tld_names.dat
static public List<String> DOMAINS = new List<String>();
}
}
google.com is not guaranteed to be the same as www.google.com (well, for this example it technically is, but may be otherwise).
maybe what you need is actually remove the "top level" domain and the "www" subodmain? Then just split('.')
and take the part before the last part!