wget to clone a website, with links to directory not index.html

I would like to clone a wordpress blog with wget so that I can include it as static content in a web app I am writing.

At the moment I am using the following to clone the site:

wget -rk http://sitename.com

This is working well, but the links in the generated html point to the index.html file. I would like those links to point to the directory that contains the file.

e.g. for the page http://sitename.com/blog-post-about-cats/ wget is generating a directory "blog-post-about-cats" and putting a index.html file in there. Links to that blog post are written as "../blog-post-about-cats/index.html" where I want them to be "../blog-post-about-cats/".

I guess I want it because I think the index.html in the url looks a bit ugly and these pages are all about presentation.

Any ideas? Is this possible with wget or perhaps a different command line tool?

Thanks.


Solution 1:

I assume wget doesn't do this by default because your local web server might be configured to serve up index pages for directories, rather than index.html. The simplest solution is to postprocess all the fetched HTML files afterwards with a regular expression:

find -name '*.html' | xargs sed -rie 's/href="([^"]*)\/index\.html"/href="\1\/"/gi'

If the pages on this site are some other type of file such as .php files, substitute "*.php" or whatever is suitable. The function of the regular expression is to identify strings of the form href="stuff/index.html" and remove the index.html. The xargs and find are used to apply this to all pages, and the "-i" flag to sed makes it modify files in-place. The "gi" flags in the regular expression make it replace all occurrences, and be case-insensitive (since HTML is case-insensitive).