What software can I use to scrape (download) a MediaWiki wiki? [closed]
I want to scrape an entire wiki that uses MediaWiki software. The amount of pages are pretty small, but they have plenty of revisions, and I'd like to preferably scrape revisions as well.
The wiki does not offer database dumps, unlike Wikipedia. Are there any existing software/scripts designed to scrape MediaWiki sites?
Solution 1:
If the maintainer of the wiki hasn't turned it off, you can export pages with their history through Special:Export. This will give you an XML dump similar to Wikipedia's database dumps, which you can then import into another wiki.
Another way to obtain page history from MediaWiki in XML format is to use the prop=revisions API query. However, the API results format is somewhat different from that produced by Special:Export, so you'll probably have to process the output a bit before you can feed it to standard import scripts.
Solution 2:
Check out the tools available at from WikiTeam. http://archiveteam.org/index.php?title=WikiTeam
I personally use wikiteam's dumpgenerator.py which is available here: https://github.com/WikiTeam/wikiteam
It depends on python 2. You can get the software using git or download the zip from github:
git clone https://github.com/WikiTeam/wikiteam.git
The basic usage is:
python dumpgenerator.py http://wiki.domain.org --xml --images