How can I make wget rename downloaded files to not include the query string?

I'm downloading a site with wget and a lot of the links have queries attached to them, so when I do this:

wget -nv -c -r -H -A mp3 -nd http://url.to.old.podcasts.com/

I end up with a lot of files like this:

1.mp3?foo=bar
2.mp3?blatz=pow
3.mp3?fizz=buzz

What I'd like to end up with is:

1.mp3
2.mp3
3.mp3

This is all taking place in ubuntu linux and I've got wget 1.10.2.

I know I can do this after I get everything via a script to rename everything. However I'd really like a solution from within wget so I can see the correct names as the download is happening.

Can anyone help me unravel this?

Solution 1:

If the server is kind, it might be sticking a Content-Disposition header on the download advising your client of the correct filename. Telling wget to listen to that header for the final filename is as simple as:

wget --content-disposition

You'll need a newish version of wget to use this feature.

I have no idea how well it handles a server claiming a filename of '/etc/passwd'.

Solution 2:

I realized after processing a large batch that I should have instructed wget to ignore the query strings. I did not want to do it over again so I made this script which worked for me:

# /bin/bash
for i in `find $1 -type f`
do
    mv $i `echo $i | cut -d? -f1`
done

Put that in a file like rmqstr and chmod +x rmqstr Syntax: ./rmqstr <directory (defaults to .)>

It will remove the query strings from all filenames recursively.

Solution 3:

I think, in order to get wget to save as a filename different than the URL specifies, you need to use the -O filename argument. That only does what you want when you give it a single URL -- with multiple URLs, all downloaded content ends up in filename.

But that's really the answer. Instead of trying to do it all in one wget command, use multiple commands. Now your workflow becomes:

Run wget to get the base HTML file(s) containing your links;
Parse for URLs;
Foreach URL ending in mp3,
1. process URL to get a filename (eg turn http://foo/bar/baz.mp3?gargle=blaster into baz.mp3
2. (optional) check that filename doesn't exist
3. run wget <URL> -O <filename>

That solves your problem, but now you need to figure out how to grab the base files to find your mp3 URLs.

Do you have a particular site/base URL in mind? Steps 1 and 3 will be easier to handle with a concrete example.

Solution 4:

so I can see the correct names as the download is happening.

OK. Use wget as you normally do; use the post-wget script that you normally use, but process wget's output so that it's easier on the eyes:

#! /bin/sh
exec wget --progress=bar:force $* 2>&1 | \
  perl -pe 'BEGIN { $| = 1 } s,(?<=`)([^\x27?]+),\e[36;1m$1\e[0m, if /^Saving/'
cgi-cut # rename files

This will still show the ?foo=bar as you download, but will display the rest of the name in bright cyan.

How can I make wget rename downloaded files to not include the query string?

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Related

Recent Posts