How do you use WGET to mirror a site 1 level deep, recovering JS, CSS resources including CSS images?

Pretend I wanted a simple page copy to be downloaded to my HD for permanent keeping. I'm not looking for a deep recursive get, just a single page, but also any resources loaded by that page to be also downloaded.

Example: https://www.tumblr.com/

Expect:

  • The index.html
  • Any loaded images
  • Any loaded JS files
  • Any loaded CSS files
  • Any images loaded in the CSS file
  • links for the page resources localized to work with the downloaded copies (no web dependency)

I'm interested to know if you can help me find the best wget syntax or other tool that will do this. The tools I have tried usually fail to get the images loaded by CSS, so the page never looks right when loaded locally. Thank you!

Tangent Solution

I found a way to do this using FireFox. The default save is broken and there is an addon that is called "Save Complete" which apparently can do a good job with this. However, you can't download it because it says it is not supported in current FireFox version. The reason is that it was rolled into this addon: "Mozilla Archive Format". Install that, then when you use File > "Save Page As.." there is a new option called "Web Page, complete" which is essentially the old addon, which fixes the stock implementation FireFox uses (which is terrible). This isn't a WGET solution but it does provide a workable solution.

EDIT: Another ridiculous issue for anyone who might be following this question in future, trying to do this. Do get the addon to work properly you need to Tools > Mozilla Archive Format and change the (terrible) default setting of "take a faithful snapshot of the page" to "preserve scripts and source using Save Complete", otherwise the addon will empty all your script files and replace them with the text "/* Script removed by snapshot save */".


Solution 1:

wget -p -k http://ExampleSite.com

The -p will get you all the required elements to view the site correctly (css, images, etc). The -k will change all links (to include those for CSS & images) to allow you to view the page offline as it appeared online.

Update: This is specific for your example site: tumblr.com

wget -H -N -k -p --exclude-domains quantserve.com --no-check-certificate -U "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2" https://www.tumblr.com

The Breakdown:

-H = Allows wget to go to span a foreign host. Required since tumblr does not have its images on the front page on the same address, they are using secure.assets.tumblr.com see note on excluding domains

-N = will grab only files that are newer that what you currently have, in case you are downloading the same page again over time

-k = convert your links to view it offline properly

-p = grabs all required elements to view it correctly (css, images, etc)

--exclude-domains = since the tumblr.com homepage has a link for quantserve.com and i'm guessing you don't want this stuff, you need to exclude it from your wget download. Note: This is a pretty important one that you should use with -H because if you go to a site and they have multiple links for outside hosts (think advertisers & analytics stuff) then you are going to grab that stuff also!

--no-check-certificate required since tumblr is using https

-U changes the user-agent. Not really necessary in this instance since it allows the default wget user-agent but I know some sites will block it. I just threw it in here so in case you run into any problems on other sites. In the example snippet I gave, it appears as Mozilla Firefox 6.02a

finally you have the site: https://www.tumblr.com

Solution 2:

For the specific site you mentioned and many others coded like it wget (and curl) just won't work. The issue is that some of the asset links required to render the page in a browser are themselves created through javascript. Wget has a feature request pending to run javascript:

http://wget.addictivecode.org/FeatureSpecifications/JavaScript

However until that is complete sites that build asset links using javascript won't be cloneable using wget. The easiest solution is to find a tool that is actually building a DOM and parsing javascript like a browser engine (i.e. the firefox method you mentioned).