How to: Download a page from the Wayback Machine over a specified interval

What I mean is to download each page available from the Wayback Machine over a specified time period and interval. For example, I want to download each page available from each day from nature.com from January of 2012 to December of 2012. (Not precisely what I want to do, but it's close enough -- and provides a good example.)

wget won't work due to the unique nature of how the Wayback machine works, unfortunately.

Tools like Wayback Machine downloader only download the most recent version of the page, it seems.

Interacting with the IA API seems like a viable route, but I'm not sure how that would work.

Thanks!


Solution 1:

The way wayback URLs are formatted are as follows:

http://$BASEURL/$TIMESTAMP/$TARGET

Here BASEURL is usually http://web.archive.org/web (I say usually as I am unsure if it is the only BASEURL)

TARGET is self explanatory (in your case http://nature.com, or some similar URL)

TIMESTAMP is YYYYmmddHHMMss when the capture was made (in UTC):

  • YYYY: Year
  • mm: Month (2 digit - 01 to 12)
  • dd: Day of month (2 digit - 01 to 31)
  • HH: Hour (2 digit - 00 to 23)
  • MM: Minute (2 digit - 00 to 59)
  • ss: Second (2 digit - 00 to 59)

In case you request a capture time that doesn't exist, the wayback machine redirects to the closest capture for that URL, whether in the future or the past.

You can use that feature to get each daily URL using curl -I (HTTP HEAD) to get the set of URLs:

BASEURL='http://web.archive.org/web'
TARGET="SET_THIS"
START=1325419200 # Jan 1 2012 12:00:00 UTC (Noon) 
END=1356998400 # Tue Jan  1 00:00:00 UTC 2013
if uname -s |grep -q 'Darwin' ; then
    DATECMD="date -u '+%Y%m%d%H%M%S' -r "
elif uname -s |grep -q 'Linux'; then
    DATECMD="date -u +%Y%m%d%H%M%S -d @"
fi


while [[ $START -lt $END ]]; do
    TIMESTAMP=$(${DATECMD}$START)
    REDIRECT="$(curl -sI "$BASEURL/$TIMESTAMP/$TARGET" |awk '/^Location/ {print $2}')"
    if [[ -z "$REDIRECT" ]]; then
        echo "$BASEURL/$TIMESTAMP/$TARGET"
    else
        echo $REDIRECT
    fi
    START=$((START + 86400)) # add 24 hours
done

This gets you the URLs that are closest to noon on each day of 2012. Just remove the duplicates, and, and download the pages.

Note: The Script above can probably be greatly improved to jump forward in case the REDIRECT is for a URL more than 1 day in the future, but then it requires deconstructing the returned URL, and adjusting START to the correct date value.

Solution 2:

There is a ruby gem on Github: https://github.com/hartator/wayback-machine-downloader