Easiest way to extract the urls from an html page using sed or awk only

You could also do something like this (provided you have lynx installed)...

Lynx versions < 2.8.8

lynx -dump -listonly my.html

Lynx versions >= 2.8.8 (courtesy of @condit)

lynx -dump -hiddenlinks=listonly my.html

You asked for it:

$ wget -O - http://stackoverflow.com | \
  grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.


With the Xidel - HTML/XML data extraction tool, this can be done via:

$ xidel --extract "//a/@href" http://example.com/

With conversion to absolute URLs:

$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
  1. The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
  2. The first sed will add a newline in front of each a href url tag with the \n
  3. The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline Both seds will give you each url on a single line, but there is garbage, so
  4. The 2nd grep href cleans the mess up
  5. The sort and uniq will give you one instance of each existing url present in the sourcepage.html

I made a few changes to Greg Bacon Solution

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

This fixes two problems:

  1. We are matching cases where the anchor doesn't start with href as first attribute
  2. We are covering the possibility of having several anchors in the same line