How to crawl using wget to download ONLY HTML files (ignore images, css, js)

Solution 1:

@ernie's comment about --ignore-tags lead me down the right path! When I looked up --ignore-tags in man, I noticed --follow-tags.

Setting --follow-tags=a allowed me to skip img, link, script, etc.

It's probably too limited for some people looking for the same answer, but it actually works well in my case (it's okay if I miss a couple pages).

If anyone finds a way to allow for scanning ALL tags, but prevents wget from rejecting files only after they're downloaded (they should reject based on filename or header Content-type before downloading), I will very happily accept their answer!

Solution 2:

what about adding the options:

--reject '*.js,*.css,*.ico,*.txt,*.gif,*.jpg,*.jpeg,*.png,*.mp3,*.pdf,*.tgz,*.flv,*.avi,*.mpeg,*.iso'
--ignore-tags=img,link,script 
--header="Accept: text/html"

How to crawl using wget to download ONLY HTML files (ignore images, css, js)

Solution 1:

Solution 2:

Related

Recent Posts