How can I make wget rename downloaded files to not include the query string?
I'm downloading a site with wget and a lot of the links have queries attached to them, so when I do this:
wget -nv -c -r -H -A mp3 -nd http://url.to.old.podcasts.com/
I end up with a lot of files like this:
1.mp3?foo=bar
2.mp3?blatz=pow
3.mp3?fizz=buzz
What I'd like to end up with is:
1.mp3
2.mp3
3.mp3
This is all taking place in ubuntu linux and I've got wget 1.10.2.
I know I can do this after I get everything via a script to rename everything. However I'd really like a solution from within wget so I can see the correct names as the download is happening.
Can anyone help me unravel this?
Solution 1:
If the server is kind, it might be sticking a Content-Disposition header on the download advising your client of the correct filename. Telling wget to listen to that header for the final filename is as simple as:
wget --content-disposition
You'll need a newish version of wget to use this feature.
I have no idea how well it handles a server claiming a filename of '/etc/passwd'.
Solution 2:
I realized after processing a large batch that I should have instructed wget
to ignore the query strings. I did not want to do it over again so I made this script which worked for me:
# /bin/bash
for i in `find $1 -type f`
do
mv $i `echo $i | cut -d? -f1`
done
Put that in a file like rmqstr
and chmod +x rmqstr
Syntax: ./rmqstr <directory (defaults to .)>
It will remove the query strings from all filenames recursively.
Solution 3:
I think, in order to get wget
to save as a filename different than the URL specifies, you need to use the -O filename
argument. That only does what you want when you give it a single URL -- with multiple URLs, all downloaded content ends up in filename
.
But that's really the answer. Instead of trying to do it all in one wget
command, use multiple commands. Now your workflow becomes:
- Run
wget
to get the base HTML file(s) containing your links; - Parse for URLs;
- Foreach URL ending in
mp3
,- process URL to get a filename (eg turn
http://foo/bar/baz.mp3?gargle=blaster
intobaz.mp3
- (optional) check that filename doesn't exist
- run
wget <URL> -O <filename>
- process URL to get a filename (eg turn
That solves your problem, but now you need to figure out how to grab the base files to find your mp3
URLs.
Do you have a particular site/base URL in mind? Steps 1 and 3 will be easier to handle with a concrete example.
Solution 4:
so I can see the correct names as the download is happening.
OK. Use wget as you normally do; use the post-wget script that you normally use, but process wget's output so that it's easier on the eyes:
#! /bin/sh
exec wget --progress=bar:force $* 2>&1 | \
perl -pe 'BEGIN { $| = 1 } s,(?<=`)([^\x27?]+),\e[36;1m$1\e[0m, if /^Saving/'
cgi-cut # rename files
This will still show the ?foo=bar
as you download, but will display the rest of the name in bright cyan.