Create a complete local copy of a website using Wget

OK, here is what I need :

  • I'm on Mac (Mac OS X 10.6.8)
  • I want to completely mirror a website on my hard drive (this is what I'm using as a test)
  • I want all images and prerequisites there, so that the website is browsable when offline
  • I want relative links in all pages to be updated accordingly
  • (* optional) .html extensions to all files would be great, so that they can be easily recognized and opened by a browser

This is what I'm using :

wget --recursive --no-clobber --page-requisites --convert-links --html-extension --domains wikispaces.com http://chessprogramming.wikispaces.com/

The thing is :

  • .css files and images, etc do not seem to be downloaded - at least, up to the level I've left running (ok, maybe they would be downloaded if the process was completed, so we may skip this one)
  • NO html extension is being added
  • Links are not converted

So... any ideas?


Solution 1:

First off, this seems to be an OS X only problem. I can use the above command on Ubuntu 14.04 LTS and it works out of the box! A few suggestions:

.css files and images, etc do not seem to be downloaded – at least, up to the level I've left running (ok, maybe they would be downloaded if the process was completed, so we may skip this one)

  1. When you say --domains wikispaces.com, you will not be downloading linked CSS files located on other domains. Some of the stylesheets on that website are located on http://c1.wikicdn.com as suggests the source of index.html.

  2. Some websites do not allow you to access their linked files (referenced images) directly using their link (see this page). You can only view them through the website. That doesn't seem to be the case here though.

  3. Wget does not seem to recognize comments while parsing the HTML. I see the following when Wget is running:

    --2016-07-01 04:01:12--  http://chessprogramming.wikispaces.com/%3C%25-%20ws.context.user.imageUrlPrefix%20%25%3Elg.jpg
    Reusing existing connection to chessprogramming.wikispaces.com:80.
    HTTP request sent, awaiting response... 404 Not Found
    2016-07-01 04:01:14 ERROR 404: Not Found.
    

    Opening the link in browser takes you to a login page. The name of the file suggests that it occurred somewhere in the comments.

  4. Many sites do not allow being downloaded using download managers, hence they check which client originated the HTTP request (which includes the browser, or whatever client you used to request a file from their server).

    Use -U somebrowser to fake the client and pretend to be a browser.  For example, -U mozilla can be added to tell the server that a Mozilla/Firefox is requesting the page.  This, however, is not the issue here since I can download the site without this argument.

  5. The download and request rate is important. Servers do not want their performance to be bloated by robots requesting data from their site. Use --limit-rate= and --wait= arguments in Wget to limit the download rate and wait a few seconds between generating get requests for individual files.

    e.g.

    wget -r --wait=5 --limit-rate=100K <other arguments>
    

    to wait 5 seconds between get requests and limit the download rate to 100Kbps. Once again, this is not the issue here because the server did not require me to limit the download rate to fetch the website.

The most possible case here is (1). Replace the --domains wikispaces.com with --domains * and try again. Let's see where we get. You should be able to fetch the CSS files at least.

NO html extension is being added

HTML extension is being added when I run the command.

Links are not converted

I don't think if I am totally correct here, but do not expect links to work out of the box when you mirror a site.

When you pass argument to the HTTP get request (for example http://chessprogramming.wikispaces.com/wiki/xmla?v=rss_2_0 has the arguments v=rss_2_0), the request is dealt with some script running on the server, for example PHP. The arguments will help you fetch the correct version of the script depending on the argument(s). Remember, when you are mirroring a site, specially a Wiki, which runs on PHP, you can't exactly mirror a site unless you fetch the original PHP scripts. HTML pages returned by PHP scripts are just one face of the page you can expect to see with that script. The correct algorithm that generates the page is stored on the server and will only mirror correctly if you fetch the original PHP file, which you can't do with HTTP. For that you need FTP access to the server.

Hope this helps.

Solution 2:

Option 1 from user612013's answer was certainly the problem in my case. In fact, it just went wrong because I requested https://censoreddomain.com instead of https://www.censoreddomain.com (note the www.). Once I added the www., wget happily scraped the entire site for me. So it is important to exactly match the canonical name of the domain that you are trying to scrape.

Since the mistake was mine, I think this "catch" applies to wget on all platforms, not just OS X.