How can I download an entire (active) phpbb forum?

I am doing this right now. Here's the command I'm using:

wget -k -m -E -p -np -R memberlist.php*,faq.php*,viewtopic.php*p=*,posting.php*,search.php*,ucp.php*,viewonline.php*,*sid*,*view=print*,*start=0* -o log.txt http://www.example.com/forum/

I wanted to strip out those pesky session id things (sid=blahblahblah). They seem to get added automatically by the index page, and then get attached to all the links in a virus-like fashion. Except for one squirreled away somewhere - which links to a plain index.php which then continues with no sid= parameter. (Perhaps there's a way to force the recursive wget to start from index.php - I don't know).

I have also excluded some other pages that lead to a lot of cruft being saved. In particular memberlist.php and viewtopic.php where p= is specified can create thousands of files!

Due to this bug in wget http://savannah.gnu.org/bugs/?20808 it will still download an astounding number of those useless files - esepcially viewtopic.php?p= ones - before simply deleting them. So this is going to burn a lot of time and bandwidth.


I recently faced a similar issue with a phpBB site I frequent facing imminent extinction (sadly, due to the admin passing away). With over 7 years of posts on the forum I didn't want to see it vanish, so I wrote a perl script to walk all the topics and save them to disk as flat HTML files. In case anyone else is facing a similar problem, the script is available here:

https://gist.github.com/2030469

It relies on a regex to extract the number of posts in a topic (needed to paginate) but other than that should generally work. Some of the regexes may need tweaking depending on your phpBB theme.


Try some combination of wget flags like:

wget -m -k www.example.org/phpbb

Where -m is mirror, and -k is "convert links". You may also wish to add -p, to download images, as I can't recall whether -m does this.