wget recursive download, but I don't want to follow all links

You might also try HTTrack which has, IMO, more flexible and intuitive include/exclude logic. Something like this...

httrack "https://example.com" -O ExampleMirrorDirectory \
"-*" \
"+https://example.com/images/*" \
"-*.swf"

The rules will be applied in order, and will override previous rules...

Exclude everything
But include https://example.com/images/*
But exclude anything ending in swf

Looks like this isn't possible in wget

Under the --reject section of 'man wget':

"Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix."

If you are doing this, you might want to give examples of the patterns you are using and what you think should match, and that doesn't. You say they are matching, but are you sure?

Also, make sure you put this list in quotes, so the shell doesn't expand those wildcards before passing the argument(s) to wget.

Even if your system doesn't have version 1.12 , read the Types of Files section of the manual here. According to the change log, the maintainer added some caveats:

* NEWS: Added documentation change re: --no-parents, and various
caveats on accept/reject lists behavior. Rearranged some items in
order of priority.

You could restrict the level of recursion with the -l NUMBER option, if that helps (not following a certain regex pattern).

A level of "2" downloads index.html, its subsites/images/etc and the links on the subsite.

how do you use wget? try to use it in this way:

wget -r --reject=gif,jpg,swf http://norc.aut.ac.ir/

this command will ignore gif and jpg and swf files.

wget recursive download, but I don't want to follow all links

Related

Recent Posts