How to download a website from the archive.org Wayback Machine?

Solution 1:

I tried different ways to download a site and finally I found the wayback machine downloader - which was built by Hartator (so all credits go to him, please), but I simply did not notice his comment to the question. To save you time, I decided to add the wayback_machine_downloader gem as a separate answer here.

The site at http://www.archiveteam.org/index.php?title=Restoring lists these ways to download from archive.org:

  • Wayback Machine Downloader, small tool in Ruby to download any website from the Wayback Machine. Free and open-source. My choice!
  • Warrick - Main site seems down.
  • Wayback downloader , a service that will download your site from the Wayback Machine and even add a plugin for Wordpress. Not free.

Solution 2:

This can be done using a bash shell script combined with wget.

The idea is to use some of the URL features of the wayback machine:

  • http://web.archive.org/web/*/http://domain/* will list all saved pages from http://domain/ recursively. It can be used to construct an index of pages to download and avoid heuristics to detect links in webpages. For each link, there is also the date of the first version and the last version.
  • http://web.archive.org/web/YYYYMMDDhhmmss*/http://domain/page will list all version of http://domain/page for year YYYY. Within that page, specific links to versions can be found (with exact timestamp)
  • http://web.archive.org/web/YYYYMMDDhhmmssid_/http://domain/page will return the unmodified page http://domain/page at the given timestamp. Notice the id_ token.

These are the basics to build a script to download everything from a given domain.

Solution 3:

You can do this easily with wget.

wget -rc --accept-regex '.*ROOT.*' START

Where ROOT is the root URL of the website and START is the starting URL. For example:

wget -rc --accept-regex '.*http://www.math.niu.edu/~rusin/known-math/.*' http://web.archive.org/web/20150415082949fw_/http://www.math.niu.edu/~rusin/known-math/

Note that you should bypass the Web archive's wrapping frame for START URL. In most browsers, you can right-click on the page and select "Show Only This Frame".

Solution 4:

There is a tool specifically designed for this purpose, Warrick: https://code.google.com/p/warrick/

It's based on the Memento protocol.