wget web crawler retrieves unwanted index.html index files

To exclude index-sort files such as those with URL index.html?C=... without excluding any other kind of index.html* files, there is indeed a more precise specification possible. Try: -R '\?C='

Quick Demo

Set up an different empty directory, for example

$ mkdir ~/experiment2
$ cd ~/experiment2

Then a shorter version of your command, without the recursion and levels in order to do a quick one page test:

$ wget --tries=inf --timestamping --convert-links --page-requisites --no-parent -R '\?C=' http://ioccc.org/2013/cable3/

After wget is done, ~/experiment2, will have no index.html?C=... files:

.
└── ioccc.org
    ├── 2013
    │   └── cable3
    │       └── index.html
    ├── icons
    │   ├── back.gif
    │   ├── blank.gif
    │   ├── image2.gif
    │   ├── text.gif
    │   └── unknown.gif
    └── robots.txt

4 directories, 7 files

So it has indeed excluded those redundant index-sort index.html?C=... directories while keeping all other index.html directories, in this case just index.html

Implement

So just implement the -R '\?C=' , by updating your shell function in ~/.bashrc:

crwl() {
  wget --tries=inf --timestamping --recursive --level=inf --convert-links --page-requisites --no-parent -R '\?C=' "$@"
}

Then remember to either test in a new terminal, or re-source bash to make it effective:

$ . ~/.bashrc

Then try it in a new directory, for comparison:

$ mkdir ~/experiment3
$ cd ~/experiment3
$ crwl http://ioccc.org/2013/cable3/

Warranty

  • wget 1.14 and up only. So if your wget -V says it is 1.13 this may not work and you have need to actually delete those pesky index.html?C=... yourself, or try to get a more recent version of wget.
  • works by specifying you want to -R or reject a pattern, in this case pages with ?C= pattern that is typical of the index.html?C=... versions of index.html.
  • however ? happens to be a wget wildcard, thus to match a literal ? you need to escape it as \?
  • don't interrupt wget. Because it seems the way wget works with browse-able web pages is to actually download first, delete later, as if it needs to check in case those pages have further links to crawl. So if you cancel this halfway you are still going to end up with index.html?C= files. Only if you let wget finish, then wget will follow your -R specification and delete any temporarily downloaded index.html?C=... files for you

Try this after download, if you do not want to use wget's removal mechanism or are on a system not suporting this option.

FIND=$($WHICH find)
PWD2=$($WHICH pwd)
SH=$($WHICH sh)
ECHO=$($WHICH echo)
LESS=$($WHICH less)

Command:

$FIND "$($PWD2)" -regextype posix-egrep -type f -regex '^(.*?html\?C=[DNSM];O=[AD])$' -exec "$SH" -c 'o="{}";$ECHO -f -v "${o}"' \; | $LESS

When you are satisfied with the output, do the following:

  1. Issue the following command (see below box)
  2. Replace $ECHO with $RM in the above command.
  3. Remove the pipe (|) and the $LESS, to get actual output.

(I'm not responsible for when you delete your whole file system, therefore this way.)

RM=$($WHICH rm);export RM
$FIND "$($PWD2)" -regextype ... ;$RM -f -v "${xox}"' \; 

Hope this helps.