How to download with wget without following links with parameters
I'm trying to download two sites for inclusion on a CD:
http://boinc.berkeley.edu/trac/wiki
http://www.boinc-wiki.info
The problem I'm having is that these are both wikis. So when downloading with e.g.:
wget -r -k -np -nv -R jpg,jpeg,gif,png,tif http://www.boinc-wiki.info/
I do get a lot of files because it also follows links like ...?action=edit ...?action=diff&version=...
Does somebody know a way to get around this?
I just want the current pages, without images, and without diffs etc.
P.S.:
wget -r -k -np -nv -l 1 -R jpg,jpeg,png,gif,tif,pdf,ppt http://boinc.berkeley.edu/trac/wiki/TitleIndex
This worked for berkeley but boinc-wiki.info is still giving me trouble :/
P.P.S:
I got what appears to be the most relevant pages with:
wget -r -k -nv -l 2 -R jpg,jpeg,png,gif,tif,pdf,ppt http://www.boinc-wiki.info
Solution 1:
The new version of wget (v.1.14) solves all these problems.
You have to use the new option --reject-regex=....
to handle query strings.
Note that I couldn't find the new manual that includes these new options, so you have to use the help command wget --help > help.txt
Solution 2:
wget --reject-regex '(.*)\?(.*)' http://example.com
(--reject-type posix
by default). Works only for recent (>=1.14) versions of wget
though, according to other comments.
Beware that it seems you can use --reject-regex
only once per wget
call. That is, you have to use |
in a single regex if you want to select on several regex :
wget --reject-regex 'expr1|expr2|…' http://example.com