Escaping query strings with wget --mirror
I'm using wget --mirror --html-extension --convert-links
to mirror a site, but I end up with lots of filenames in the format post.php?id=#.html
. When I try to view these in a browser it fails, because the browser ignores the query string when loading the file. Is there any way to replace the ?
character in the filenames with something else?
The answer of --restrict-file-names=windows
worked correctly. In conjunction with the flags --convert-links
and --adjust-extension
/-E
(formerly named --html-extension
, which also works but is deprecated) it produces a mirror that behaves as expected.
wget --mirror --adjust-extension --convert-links --restrict-file-names=windows http://www.example
Solution 1:
See the --restrict-file-names
option. While not exactly intended for this particular purpose, --restrict-file-names=windows
will probably help you along:
--restrict-file-names=modes
Change which characters found in remote URLs must be escaped during generation of local filenames. [...]
When "windows" is given, Wget escapes the characters \, |, /, :, ?, ", *, <, >, and the control characters in the ranges 0--31 and 128--159. In addition to this, Wget in Windows mode uses + instead of : to separate host and port in local file names, and uses @ instead of ? to separate the query portion of the file name from the rest. Therefore, a URL that would be saved as www.xemacs.org:4300/search.pl?input=blah in Unix mode would be saved as www.xemacs.org+4300/search.pl@input=blah in Windows mode.
Solution 2:
Your browser will view it fine if you use an URL like
file:///tmp/example.com/post.php%3Fid=1.html
instead of
file:///tmp/example.com/post.php?id=1.html
Note: if you're having trouble with internal links from downloaded files, it'd be because you terminated wget before it was done with the downloading. Since you specified --convert-links and --html-extension (only applies when those are given), wget would normally fix the links to use %3F instead of ?; however, it does this at the end, after it's finished downloading; if it has been interrupted, it will not have fixed any of the links, and you're left in this predicament. Of course, you can always write a script to go through and fix the links, but...