How can I protected my sitemap index file and sitemap.xml files from leechers?

You could always use a URL for the sitemap which will not be disclosed to anyone else apart from the engines that you'll explicitly submit to.

Have a look at http://en.wikipedia.org/wiki/Sitemaps


You should use a whitelist and only allow good search engines access to these sitemap files like Google and Bing.

This is a huge problem that I'm afraid most people don't even consider when submitting sitemap files to Google and Bing. I track every request to my xml sitemap files and I've denied access to over 6,500 IPs since I started doing this (3 months ago). Only Google, Bing, and a few others only ever to get to view these files now.

Since you are using a whitelist and not a blacklist, they can buy all the proxies they want and they will never get through. Also, you should perform a reverse DNS lookup as well before you whitelist and IP to make sure they really are from Google or Bing. As for how to do this in PHP, I have no idea as we are a Microsoft shop and only do ASP.NET development. I would start by getting the range of IPs that Google and Bing run their bots out of, then when a request comes in from one of those IPs, perform a DNS lookup and make sure "googlebot" or "msnbot" is in the DNS name, if it is, then perform a reverse DNS lookup against that name to make sure that the IP Address returned matches the original IP Address. If it does, then you can safely allow the IP to view your sitemap file, if it doesn't, deny access and 404 the jokers. I got that technique talking to a Google techie BTW so it's pretty solid.

Note, I own and operate a site that does around 4,000,000 page views a month so for me this was a huge priority as I didn't want my data that easily scrapped. Also, I employ the use of recaptcha after 50 page requests from the same IP in a 12 hour period and that really works well to weed out bots.

I took the time to write this post as I hope it will help someone else out and shed some light on what I think is a problem that goes largely unnoticed.