Easiest way to extract the urls from an html page using sed or awk only
You could also do something like this (provided you have lynx installed)...
Lynx versions < 2.8.8
lynx -dump -listonly my.html
Lynx versions >= 2.8.8 (courtesy of @condit)
lynx -dump -hiddenlinks=listonly my.html
You asked for it:
$ wget -O - http://stackoverflow.com | \
grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.
With the Xidel - HTML/XML data extraction tool, this can be done via:
$ xidel --extract "//a/@href" http://example.com/
With conversion to absolute URLs:
$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
- The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
- The first sed will add a newline in front of each a href url tag with the \n
- The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline Both seds will give you each url on a single line, but there is garbage, so
- The 2nd grep href cleans the mess up
- The sort and uniq will give you one instance of each existing url present in the sourcepage.html
I made a few changes to Greg Bacon Solution
cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
This fixes two problems:
- We are matching cases where the anchor doesn't start with href as first attribute
- We are covering the possibility of having several anchors in the same line