How do I extract all the external links of a web page and save them to a file?
You will need 2 tools, lynx and awk, try this:
$ lynx -dump http://www.google.com.br | awk '/http/{print $2}' > links.txt
If you need numbering lines, use command nl, try this:
$ lynx -dump http://www.google.com.br | awk '/http/{print $2}' | nl > links.txt
Here's an improvement on lelton's answer: you don't need awk at all for lynx's got some useful options.
lynx -listonly -nonumbers -dump http://www.google.com.br
if you want numbers
lynx -listonly -dump http://www.google.com.br
As discussed in other answers, Lynx is a great option, but there are many others in nearly every programming language and environment.
Another choice is xmllint
. Sample usage:
$ curl -sS "https://superuser.com" \
| xmllint --html --xpath '//a[starts-with(@href, "http")]/@href' 2>/dev/null - \
| sed 's/^ href="\|"$//g' \
| tail -3
https://linkedin.com/company/stack-overflow
https://www.instagram.com/thestackoverflow
https://stackoverflow.com/help/licensing
Additionally, Perl offers HTML::Parser
:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
use LWP::Simple;
sub start {
my $href = shift->{href};
print "$href\n" if $href && $href =~ /^https?:\/\//;
}
my $url = shift @ARGV or die "No argument URL provided";
my $parser = HTML::Parser->new(api_version => 3, start_h => [\&start, "attr"]);
$parser->report_tags(["a"]);
$parser->parse(get($url) or die "Failed to GET $url");
Sample usage (including writing to file per OP request; usage is the same for any script here with a shebang):
$ ./scrape_links https://superuser.com > links.txt \
&& cat links.txt | tail -3
https://linkedin.com/company/stack-overflow
https://www.instagram.com/thestackoverflow
https://stackoverflow.com/help/licensing
Ruby has the nokogiri gem:
#! /usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(URI.open('https://superuser.com'))
doc.xpath('//a[starts-with(@href, "http")]/@href').each do |link|
puts link.content
end
NodeJS has cheerio:
const axios = require("axios");
const cheerio = require("cheerio");
(async () => {
const $ = cheerio.load((await axios.get("https://superuser.com")).data);
$("a").each((i, e) => console.log($(e).attr("href")));
})();
Python's BeautifulSoup hasn't been shown yet in this thread:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://superuser.com").text, "lxml")
for x in soup.find_all("a", href=True):
if x["href"].startswith("http"):
print(x["href"])