Using Wget to Recursively Crawl a Site and Download Images

Solution 1:

Why won't you try to use wget -A jpg,jpeg -r http://example.com?

Solution 2:

How do you expect wget to know the contents of subpage13.html (and so the jpg's that it links to) if it is not allowed to download it. I suggest you allow html, get what you want, then remove what you don't want.

I'm not quite sure about why your cgi's are getting rejected... is there any error output by wget? Perhaps make wget verbose (-v) and see. Might be best as a separate question.

That said, if you don't care about bandwidth and download lots then remove what you don't want after, it doesn't matter.

Also check out --html-extension

From the man page:

-E

--html-extension

If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp .[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. This is useful, for instance, when youâre mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when youâre downloading CGI-gener- ated materials. A URL like http://site.com/article.cgi?25 will be saved as article.cgi?25.html.

Note that filenames changed in this way will be re-downloaded every time you re-mirror a site, because Wget canât tell that the local X.html file corresponds to remote URL X (since it doesnât yet know that the URL produces output of type text/html or application/xhtml+xml. To prevent this re-downloading, you must use -k and -K so that the original version of the file will be saved as X.orig.

--restrict-file-names=unix might also be useful due to those cgi urls...

Using Wget to Recursively Crawl a Site and Download Images

Solution 1:

Solution 2:

Related

Recent Posts