How to crawl using wget to download ONLY HTML files (ignore images, css, js)
Solution 1:
@ernie's comment about --ignore-tags
lead me down the right path!
When I looked up --ignore-tags
in man
, I noticed --follow-tags
.
Setting --follow-tags=a
allowed me to skip img
, link
, script
, etc.
It's probably too limited for some people looking for the same answer, but it actually works well in my case (it's okay if I miss a couple pages).
If anyone finds a way to allow for scanning ALL tags, but prevents wget
from rejecting files only after they're downloaded (they should reject based on filename or header Content-type before downloading), I will very happily accept their answer!
Solution 2:
what about adding the options:
--reject '*.js,*.css,*.ico,*.txt,*.gif,*.jpg,*.jpeg,*.png,*.mp3,*.pdf,*.tgz,*.flv,*.avi,*.mpeg,*.iso'
--ignore-tags=img,link,script
--header="Accept: text/html"