extract the top-level domain and the second-level domain from a URL
I'd like to extra the top-level domain and the second-level domain from a URL like "https://apple.stackexchange.com/questions/ask"
Example URL with desired result below.
https://apple.stackexchange.com/questions/ask
stackexchange.com
https://www.nytimes.com/2019/07/16/science/5g-cellphones-wireless-cancer.html
nytimes.com
https://nextdoor.com/news_feed/?post=117602&ct=-A17-ghvVOF0tfn9vptW_5a7JOBEyP4w6_hJAZUnMQqN56952&ec=OWKiQRDj9vEHefhwfGYAE0s%3D&lc=1002&is=tpe
nextdoor.com
https://www.amazon.com/gp/product/B007B60SCG/ref=ox_sc_act_title_1?smid=ATVPDKIKX0DER&psc=1
amazon.com
http://www.verizon.net/index.php
verizon.net
I'm ignoring those multi-tier domains. I'd prefer to use Bash on macOS.
There are lots of pages on getting the full domain name:
-
Extract domain name from URL using bash shell parameter substitution
https://www.cyberciti.biz/faq/get-extract-domain-name-from-url-in-linux-unix-bash/
-
echo http://example.com/index.php | awk -F[/:] '{print $4}'
https://stackoverflow.com/a/11385736/1360075
I do not need this level of perfection.
https://github.com/john-kurkowski/tldextract
As you are already using awk
and are looking for a simple solution:
awk -F/ '{n=split($3, a, "."); printf("%s.%s", a[n-1], a[n])}' <<< 'http://www.example.com/index.php'
^ ^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^
| | | |
| | | last two elements
| | |
| | +--- Split the 3rd field (aka the part after //) into
| | the array 'a', using '.' as the separator for splitting.
| | Returns the number of created array elements in 'n'.
| |
| +-------------- The awk code between the '' gets run once for every
| input line, with the fields split by -F/ stored in
| $1, $2 etc. In our case $1 contains "http:", $2 is
| empty, $3 contains "www.example.com" and $4 etc. the
| various path elements (if there are any)
|
+---------------- Split the input lines into fields, separated by '/'
Parsing URLs with Bash
The following questions should provide a good starting point:
- Parse URL in shell script
- Parse below URL in bash
@pjz's answer breaks apart a URL into more manageable parts:
#!/bin/sh
INPUT_URL="https://www.amazon.com/gp/product/B007B60SCG/ref=ox_sc_act_title_1?smid=ATVPDKIKX0DER&psc=1"
# extract the protocol
proto="`echo $INPUT_URL | grep '://' | sed -e's,^\(.*://\).*,\1,g'`"
# remove the protocol
url=`echo $INPUT_URL | sed -e s,$proto,,g`
# extract the user and password (if any)
userpass="`echo $url | grep @ | cut -d@ -f1`"
pass=`echo $userpass | grep : | cut -d: -f2`
if [ -n "$pass" ]; then
user=`echo $userpass | grep : | cut -d: -f1`
else
user=$userpass
fi
# extract the host -- updated
hostport=`echo $url | sed -e s,$userpass@,,g | cut -d/ -f1`
port=`echo $hostport | grep : | cut -d: -f2`
if [ -n "$port" ]; then
host=`echo $hostport | grep : | cut -d: -f1`
else
host=$hostport
fi
# extract the path (if any)
path="`echo $url | grep / | cut -d/ -f2-`"
echo $hostport
Given the $hostport
, you should now be able to strip back the domain as desired.