wget web crawler retrieves unwanted index.html index files
To exclude index-sort files such as those with URL index.html?C=...
without excluding any other kind of index.html*
files, there is indeed a more precise specification possible. Try: -R '\?C='
Quick Demo
Set up an different empty directory, for example
$ mkdir ~/experiment2
$ cd ~/experiment2
Then a shorter version of your command, without the recursion and levels in order to do a quick one page test:
$ wget --tries=inf --timestamping --convert-links --page-requisites --no-parent -R '\?C=' http://ioccc.org/2013/cable3/
After wget is done, ~/experiment2
, will have no index.html?C=...
files:
.
└── ioccc.org
├── 2013
│ └── cable3
│ └── index.html
├── icons
│ ├── back.gif
│ ├── blank.gif
│ ├── image2.gif
│ ├── text.gif
│ └── unknown.gif
└── robots.txt
4 directories, 7 files
So it has indeed excluded those redundant index-sort index.html?C=...
directories while keeping all other index.html directories, in this case just index.html
Implement
So just implement the -R '\?C='
, by updating your shell function in ~/.bashrc
:
crwl() {
wget --tries=inf --timestamping --recursive --level=inf --convert-links --page-requisites --no-parent -R '\?C=' "$@"
}
Then remember to either test in a new terminal, or re-source bash to make it effective:
$ . ~/.bashrc
Then try it in a new directory, for comparison:
$ mkdir ~/experiment3
$ cd ~/experiment3
$ crwl http://ioccc.org/2013/cable3/
Warranty
- wget 1.14 and up only. So if your
wget -V
says it is 1.13 this may not work and you have need to actually delete those peskyindex.html?C=...
yourself, or try to get a more recent version of wget. - works by specifying you want to
-R
or reject a pattern, in this case pages with?C=
pattern that is typical of theindex.html?C=...
versions ofindex.html
. - however
?
happens to be a wget wildcard, thus to match a literal?
you need to escape it as\?
- don't interrupt wget. Because it seems the way wget works with browse-able web pages is to actually download first, delete later, as if it needs to check in case those pages have further links to crawl. So if you cancel this halfway you are still going to end up with
index.html?C=
files. Only if you let wget finish, then wget will follow your-R
specification and delete any temporarily downloadedindex.html?C=...
files for you
Try this after download, if you do not want to use wget's removal mechanism or are on a system not suporting this option.
FIND=$($WHICH find)
PWD2=$($WHICH pwd)
SH=$($WHICH sh)
ECHO=$($WHICH echo)
LESS=$($WHICH less)
Command:
$FIND "$($PWD2)" -regextype posix-egrep -type f -regex '^(.*?html\?C=[DNSM];O=[AD])$' -exec "$SH" -c 'o="{}";$ECHO -f -v "${o}"' \; | $LESS
When you are satisfied with the output, do the following:
- Issue the following command (see below box)
- Replace $ECHO with $RM in the above command.
- Remove the pipe (|) and the $LESS, to get actual output.
(I'm not responsible for when you delete your whole file system, therefore this way.)
RM=$($WHICH rm);export RM
$FIND "$($PWD2)" -regextype ... ;$RM -f -v "${xox}"' \;
Hope this helps.