How to generate sitemap on a highly dynamic website?

Solution 1:

On Stackoverflow (and all Stack Exchange sites), a sitemap.xml file is created which contains a link to every question posted on the system. When a new question is posted, they simply append another entry to the end of the sitemap file. It isn't that resource intensive to add to the end of the file but the file is quite large.

That is the only way search engines like Google can effectively crawl the site.

Jeff Atwood talks about it in a blog post: The Importance of Sitemaps

This is from Google's webmaster help page on sitemaps:

Sitemaps are particularly helpful if:

  • Your site has dynamic content.
  • Your site has pages that aren't easily discovered by Googlebot during the crawl process - for example, pages featuring rich AJAX or Flash.
  • Your site is new and has few links to it. (Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it may be hard for us to discover it.)
  • Your site has a large archive of content pages that are not well linked to each other, or are not linked at all.

Solution 2:

There's no need to regenerate the Google sitemap XML each time a question is posted. It's far simpler just to have the XML file generated on-demand directly from the database (and a little caching).

To reduce load, the sitemap can be split into many sitemaps. Partitioning it by day/month would allow you to tell Google to retrieve today's sitemap frequently, but only fetch the sitemap from six months ago once in a while.

Solution 3:

I'd like to share my solution here just in case it helps someone as well. It took me reading this question and many others to decide what to do.

My site structure.

Static pages

  • Home (Highly dynamic. Cached for 30 mins)
  • Artists, Albums, Songs, Playlists and Albums (Paginated List)
  • Legal (Static page with Terms etc)

...etc

Dynamic Pages

  • Artists, Albums, Songs, Playlists and Albums detail pages

My approach.

sitemap.xml: This url generates a <sitemapindex /> with the first item being /sitemap-main.xml. The number of Artists, Albums, Songs etc are counted and divided by 1,000 (number of urls I want in each sitemap. the limit is 50,000). I round this number up.

So for e.g, 1900 songs = 1.9 = 2. I generate. add the urls /sitemap-songs-0.xml and /sitemap-songs-1.xml to the index. I repeat this for all other items. Basically, I am paginating.

The output is returned uncached. I want this to always be fresh.


sitemap-main.xml: This lists all the static pages. You can actually use a static file for this as you will only need to update it once in a while.


sitemap-songs-0.xml, sitemap-albums-0.xml, etc: I use a single route for this in SlimPhp 2.

$app->get('/sitemap-:type-:page.xml', function ($type, $page) use ($app) {...

I use a simple switch statement to generate the relevant files. If for this page, I got 1,000 items, the limit specified above, I cache the file for 2 Weeks. Else, I only cache it for a few hours.

I guess this can help anyone else implement their own system.