extract the top-level domain and the second-level domain from a URL

I'd like to extra the top-level domain and the second-level domain from a URL like "https://apple.stackexchange.com/questions/ask"

Example URL with desired result below.

https://apple.stackexchange.com/questions/ask
   stackexchange.com

https://www.nytimes.com/2019/07/16/science/5g-cellphones-wireless-cancer.html
   nytimes.com

https://nextdoor.com/news_feed/?post=117602&ct=-A17-ghvVOF0tfn9vptW_5a7JOBEyP4w6_hJAZUnMQqN56952&ec=OWKiQRDj9vEHefhwfGYAE0s%3D&lc=1002&is=tpe
   nextdoor.com

https://www.amazon.com/gp/product/B007B60SCG/ref=ox_sc_act_title_1?smid=ATVPDKIKX0DER&psc=1
   amazon.com

http://www.verizon.net/index.php
   verizon.net

I'm ignoring those multi-tier domains. I'd prefer to use Bash on macOS.

There are lots of pages on getting the full domain name:

Extract domain name from URL using bash shell parameter substitution

https://www.cyberciti.biz/faq/get-extract-domain-name-from-url-in-linux-unix-bash/
echo http://example.com/index.php | awk -F[/:] '{print $4}'

https://stackoverflow.com/a/11385736/1360075

I do not need this level of perfection.

https://github.com/john-kurkowski/tldextract

As you are already using awk and are looking for a simple solution:

awk -F/ '{n=split($3, a, "."); printf("%s.%s", a[n-1], a[n])}' <<< 'http://www.example.com/index.php'
      ^ ^   ^^^^^^^^^^^^^^^^^^                  ^^^^^^^^^^^^
      | |          |                                  |
      | |          |                            last two elements 
      | |          |
      | |          +--- Split the 3rd field (aka the part after //) into
      | |               the array 'a', using '.' as the separator for splitting.
      | |               Returns the number of created array elements in 'n'.
      | |
      | +-------------- The awk code between the '' gets run once for every
      |                 input line, with the fields split by -F/ stored in
      |                 $1, $2 etc. In our case $1 contains "http:", $2 is 
      |                 empty, $3 contains "www.example.com" and $4 etc. the
      |                 various path elements (if there are any)
      |
      +---------------- Split the input lines into fields, separated by '/'

Parsing URLs with Bash

The following questions should provide a good starting point:

Parse URL in shell script
Parse below URL in bash

@pjz's answer breaks apart a URL into more manageable parts:

#!/bin/sh

INPUT_URL="https://www.amazon.com/gp/product/B007B60SCG/ref=ox_sc_act_title_1?smid=ATVPDKIKX0DER&psc=1"

# extract the protocol
proto="`echo $INPUT_URL | grep '://' | sed -e's,^\(.*://\).*,\1,g'`"
# remove the protocol
url=`echo $INPUT_URL | sed -e s,$proto,,g`

# extract the user and password (if any)
userpass="`echo $url | grep @ | cut -d@ -f1`"
pass=`echo $userpass | grep : | cut -d: -f2`
if [ -n "$pass" ]; then
    user=`echo $userpass | grep : | cut -d: -f1`
else
    user=$userpass
fi

# extract the host -- updated
hostport=`echo $url | sed -e s,$userpass@,,g | cut -d/ -f1`
port=`echo $hostport | grep : | cut -d: -f2`
if [ -n "$port" ]; then
    host=`echo $hostport | grep : | cut -d: -f1`
else
    host=$hostport
fi

# extract the path (if any)
path="`echo $url | grep / | cut -d/ -f2-`"

echo $hostport

Given the $hostport, you should now be able to strip back the domain as desired.

extract the top-level domain and the second-level domain from a URL

Parsing URLs with Bash

Related

Recent Posts