Trouble using wget or httrack to mirror archived website
While helpful, prior responses fail to concisely, reliably, and repeatably solve the underlying question. In this post, we briefly detail the difficulties with each and then offer a modest httrack
-based solution.
Background
Before we get to that, however, consider perusing mpy's well-written response. In h[is|er] sadly neglected post, mpy rigorously documents the Wayback Machine's obscure (and honestly obfuscatory) archival scheme.
Unsurprisingly, it ain't pretty. Rather than sanely archiving sites into a single directory, The Wayback Machine ephemerally spreads a single site across two or more numerically identified sibling directories. To say that this complicates mirroring would be a substantial understatement.
Understanding the horrible pitfalls presented by this scheme is core to understanding the inadequacy of prior solutions. Let's get on with it, shall we?
Prior Solution 1: wget
The related StackOverflow question "Recover old website off waybackmachine" is probably the worst offender in this regard, recommending wget
for Wayback mirroring. Naturally, that recommendation is fundamentally unsound.
In the absence of complex external URL rewriting (e.g., Privoxy
), wget
cannot be used to reliably mirror Wayback-archived sites. As mpy details under "Problem 2 + Solution," whatever mirroring tool you choose must allow you to non-transitively download only URLs belonging to the target site. By default, most mirroring tools transitively download all URLs belonging to both the target site and sites linked to from that site – which, in the worst case, means "the entire Internet."
A concrete example is in order. When mirroring the example domain kearescue.com
, your mirroring tool must:
- Include all URLs matching
https://web.archive.org/web/*/http://kearescue.com
. These are assets provided by the target site (e.g.,https://web.archive.org/web/20140521010450js_/http_/kearescue.com/media/system/js/core.js
). - Exclude all other URLs. These are assets provided by other sites merely linked to from the target site (e.g.,
https://web.archive.org/web/20140517180436js_/https_/connect.facebook.net/en_US/all.js
).
Failing to exclude such URLs typically pulls in all or most of the Internet archived at the time the site was archived, especially for sites embedding externally-hosted assets (e.g., YouTube videos).
That would be bad. While wget
does provide a command-line --exclude-directories
option accepting one or more patterns matching URLs to be excluded, these are not general-purpose regular expressions; they're simplistic globs whose *
syntax matches zero or more characters excluding /
. Since the URLs to be excluded contain arbitrarily many /
characters, wget
cannot be used to exclude these URLs and hence cannot be used to mirror Wayback-archived sites. Period. End of unfortunate story.
This issue has been on public record since at least 2009. It has yet to be be resolved. Next!
Prior Solution 2: Scrapbook
Prinz recommends ScrapBook
, a Firefox plugin. A Firefox plugin.
That was probably all you needed to know. While ScrapBook
's Filter by String...
functionality does address the aforementioned "Problem 2 + Solution," it does not address the subsequent "Problem 3 + Solution" – namely, the problem of extraneous duplicates.
It's questionable whether ScrapBook
even adequately addresses the former problem. As mpy admits:
Although Scrapbook failed so far to grab the site completely...
Unreliable and overly simplistic solutions are non-solutions. Next!
Prior Solution 3: wget + Privoxy
mpy then provides a robust solution leveraging both wget
and Privoxy
. While wget
is reasonably simple to configure, Privoxy
is anything but reasonable. Or simple.
Due to the imponderable technical hurdle of properly installing, configuring, and using Privoxy
, we have yet to confirm mpy's solution. It should work in a scalable, robust manner. Given the barriers to entry, this solution is probably more appropriate to large-scale automation than the average webmaster attempting to recover small- to medium-scale sites.
Is wget
+ Privoxy
worth a look? Absolutely. But most superusers might be better serviced by simpler, more readily applicable solutions.
New Solution: httrack
Enter httrack
, a command-line utility implementing a superset of wget
's mirroring functionality. httrack
supports both pattern-based URL exclusion and simplistic site restructuring. The former solves mpy's "Problem 2 + Solution"; the latter, "Problem 3 + Solution."
In the abstract example below, replace:
-
${wayback_url}
by the URL of the top-level directory archiving the entirety of your target site (e.g.,'https://web.archive.org/web/20140517175612/http://kearescue.com'
). -
${domain_name}
by the same domain name present in${wayback_url}
excluding the prefixinghttp://
(e.g.,'kearescue.com'
).
Here we go. Install httrack
, open a terminal window, cd
to the local directory you'd like your site to be downloaded to, and run the following command:
httrack\
${wayback_url}\
'-*'\
'+*/${domain_name}/*'\
-N1005\
--advanced-progressinfo\
--can-go-up-and-down\
--display\
--keep-alive\
--mirror\
--robots=0\
--user-agent='Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5'\
--verbose
On completion, the current directory should contain one subdirectory for each filetype mirrored from that URL. This usually includes at least:
-
css
, containing all mirrored CSS stylesheets. -
html
, containing all mirrored HTML pages. -
js
, containing all mirrored JavaScript. -
ico
, containing one mirrored favicon.
Since httrack
internally rewrites all downloaded content to reflect this structure, your site should now be browsable as is without modification. If you prematurely halted the above command and would like to continue downloading, append the --continue
option to the exact same command and retry.
That's it. No external contortions, error-prone URL rewriting, or rule-based proxy servers required.
Enjoy, fellow superusers.
Unfortunately none of the answers were able to solve the problem of making a complete mirror from an archived website (without duplicating every file a dozens of times). So I hacked together another approach. Hacked is the important word as my solution is neither a general solution nor a very simple (read: copy&paste) one. I used the Privoxy Proxy Server to rewrite the files on-the-fly while mirroring with wget.
But first, what is so difficult about mirroring from the Wayback Machine?
Problem 1 + Solution
The Wayback toolbar is handy for interactive use, but might interfere with wget. So get rid of it with a privoxy filter rule
FILTER: removewaybacktoolbar remove Wayback toolbar
s|BEGIN WAYBACK TOOLBAR INSERT.*END WAYBACK TOOLBAR INSERT|Wayback Toolbar removed|s
Problem 2 + Solution
I wanted to capture the whole site, so needed a not-too-small recursion depth. But I don't want wget to crawl the whole server. Usually you use the no-parent option -np
of wget for that purpose. But that will not work here, because you want to get
http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/struk/hcp.html
but also
http://web.archive.org/web/20110801041529/http://cst-www.nrl.navy.mil/lattice/struk/a_f.html
(notice the changed timestamp in the paths). Omitting -np
will end up wget crawling up to (...)http://cst-www.nrl.navy.mil
, and finally retrieve the whole navi.mil
site. I definitely don't want that! So this filter tries to emulate the -np
behavior with the Wayback machine:
FILTER: blocknonparentpages emulate wget -np option
s|/web/([0-9].*)/http://cst-www.nrl.navy.mil/lattice/|THIS_IS_A_GOOD_$1_ADDRESS|gU
s|/web/(.*)/http(.*)([" ])|http://some.local.server/404$3|gU
s|THIS_IS_A_GOOD_(.*)_ADDRESS|/web/$1/http://cst-www.nrl.navy.mil/lattice/|gU
I'll leave it as an exercise to dig into the syntax. What this filter does is the following: It replaces all Wayback URLs like http://web.archive.org/web/20110801041529/http://www.nrl.navy.mil/
with http://some.local.server/404
as long as they do not contain http://cst-www.nrl.navy.mil/lattice/
.
You have to adjust http://some.local.server/404
. This is to send an 404 error to wget. Probably privoxy can do that more elegant. However, the easiest way for me was just to rewrite the link to a non-existent page on a local http server, so I stuck with this.
And, you also need to adjust both occurences of http://cst-www.nrl.navy.mil/lattice/
to reflect the site you want to mirror.
Problem 3 + Solution
And finally some archived version of a page might link to page in another snapshot. And that to yet another one. And so on... and you'll end up with a lot of snapshots of the same page -- and wget will never manage to finish until it has fetched all snapshots. I really don't want that, neither! Here it helps a lot, that the Wayback machine is very smart. You can request a file
http://web.archive.org/web/20110801041529/http://cst-www.nrl.navy.mil/lattice/struk/a_f.html
even if it's not included in the 20110801041529
snapshot. It automatically redirect you to the correct one:
http://web.archive.org/web/20110731225728/http://cst-www.nrl.navy.mil/lattice/struk/a_f.html
So, another privoxy filter to rewrite all snapshots to the most recent one
FILTER: rewritewaybackstamp rewrite Wayback snapshot date
s|/([0-9]{14})(.{0,3})/|/20120713212803$2/|g
Effectively every 14-digit-number enclosed in /.../
gets replaced with 20120713212803
(adjust that to the most recent snapshot of your desired site). This might be an issue if there are such numbers in the site structure not originating from the Wayback machine. Not perfect, but fine for the Strukturtypen site.
The nice thing about that is, that wget ignores the new location it is redirected to and saves the file -- in the above exampe -- as web.archive.org/web/20110801041529/http://cst-www.nrl.navy.mil/lattice/struk/a_f.html
.
Using wget to mirror archived site
So, finally with these privoxy filters (defined in user.filter
) enabled in user.action
via
{ +filter{removewaybacktoolbar} +filter{blocknonparentpages} +filter{rewritewaybackstamp} }
web.archive.org
you can use wget as usual. Don't forget to tell wget to use the proxy:
export http_proxy="localhost:8118"
wget -r -p -k -e robots=off http://web.archive.org/web/20120713212803/http://cst-www.nrl.navy.mil/lattice/index.html
I used these options, but -m
should work, too. You'll end up with the folders
20120713212803
20120713212803cs_
20120713212803im_
20120713212803js_
as the Wayback machine separates images (im_
), style sheets (cs_
) etc. I merged everything together and used some sed magic to replace the ugly relative links (../../../../20120713212803js_/http:/cst-www.nrl.navy.mil/lattice
) accordingly. But this isn't really necessary.
wget
--page-requisites
This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
Ordinarily, when downloading a single HTML page, any requisite documents that may be needed to display it properly are not downloaded. Using -r together with -l can help, but since Wget does not ordinarily distinguish between external and inlined documents, one is generally left with "leaf documents" that are missing their requisites.
For instance, say document 1.html contains an "" tag referencing 1.gif and an "" tag pointing to external document 2.html. Say that 2.html is similar but that its image is 2.gif and it links to 3.html. Say this continues up to some arbitrarily high number.
-m
--mirror
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and
keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
Note that Wget will behave as if -r had been specified, but only that single page and its requisites will be downloaded. Links from that page to external documents will not be followed. Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to -p:
wget -E -H -k -K -p http://<site>/<document>
SO wget -E -H -k -K -p http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice
will be your best suit for you. But I recommend another tool, a firefox
extension scrapbook
scrapbook
ScrapBook is a Firefox extension, which helps you to save Web pages and easily manage collections. Key features are lightness, speed, accuracy and multi-language support. Major features are:
* Save Web page
* Save snippet of Web page
* Save Web site
* Organize the collection in the same way as Bookmarks
* Full text search and quick filtering search of the collection
* Editing of the collected Web page
* Text/HTML edit feature resembling Opera's Notes
How to mirror a site
Install scrapbook and restart firefox
- Load page in browser [web page to be mirrored]
- Right click on the page -> Save page as ...
- select level from In depth Save and press save
- select
Restrict to Drirectory
/Domain
from Filter
Wait for it to mirroring to complete. After mirroring you can access the web site offline from ScrapBook
menu.