How does Wikipedia generate its Sitemap?

Solution 1:

It's dynamically generated by a PHP script. For big sites it's probably better to check for changes and only generate if something changed -- or generate it only every XY minutes/hours/days. It depends on the infrastructure.

The informations needed are all in the database, so it's not such a hard task.

And here is the proof: http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/generateSitemap.php?view=log / http://www.mediawiki.org/wiki/Manual:GenerateSitemap.php

Edit: Ah and this could be also interesting for this topic:

  • Cache strategy
  • Wikimedia servers

Solution 2:

I was faced with the task to create a site map for our web site a while back. Although it's not the size of Wikipedia, it's still around a hundred thousand pages, and about 5% of them are changed, added or removed daily.

As putting all the page references in a single file would make it too large, I had to divide them into sections. The site map index points to an aspx page with a query string for one of 17 different sections. Depending on the query string the page returns an xml referencing several thousand pages, based on which objects exist in the database.

So, the site map is not created periodically, instead it's created on the fly when someone requests it. As we already have a system for caching database searches, this is of course used to fetch data for the site map also.

Solution 3:

Although the sitemap generation code is in MediaWiki core master and would certainly be the option chosen to produce a sitemap, I don't see any evidence that Wikipedia actually has it turned on. The robots.txt file does not point to any site maps.

Further, any maintenance script run on Wikimedia projects is controlled by puppet and there is no instance of generateSitemap.php in the puppet repository. Finally, there is no sitemap in the dumps for any Wikimedia wiki either, while there are "abstracts for Yahoo".

In any case, Wikipedia runs Squid caches in front of their app servers. They can control how often their sitemap is updated by adjusting the expiry time for the page.

Moreover, whatever Wikipedia does for indexing is not a good model for your wiki, because Google has special contacts/deals/handling of Wikipedia, see a recent example.