Does google's web crawler download binary files?

My Google-fu is failing me right now.

I'm trying to figure out whether Google's web crawler downloads non-image binary files when it spiders sites. I know it downloads (and indexes) images and PDFs, but what about .zip, .dmg, etc?

My client offers a lot of sofware packages for download on their site, and they're trying to figure out whether search engines are making up much of the bandwidth involving these files.


Solution 1:

The answer to your first question seems to be "maybe":

What file types can Google index?

Google can index the content of most types of pages and files. See the most common file types.

But the link to common files types are all text.

Even if you search for binary files like Windows Installers (.msi), you may get a link to a page containing the file or a direct link to the file, but Google almost certainly decides how to index it based on what is around the link on the page, rather than by downloading and deciphering the binary files' contents.

As to your main question, Google's recommended method way for checking whether the bot hit your site or not is to use a reverse-DNS lookup:

$ host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

Keep in mind that Google's mission "is to organize the world’s information and make it universally accessible and useful." This means that they are constantly innovating, attempting to index non-text data in ways that makes it searchable. To expand on ceejayoz's idea that just because they didn't do it yesterday doesn't mean they won't do it tomorrow: Google will do everything they can to be able to it tomorrow!

Solution 2:

Instead of taking a guess, why not check the access_logs to see what the User Agent or the requesting host is? That way you can even tell how much bandwidth Google (or other crawlers) are taking, by adding the data traffic per request.

Solution 3:

I recently noticed an unusual spike in my web server's traffic. Looking at the web stats showed that the small set of large binary files on my site had been downloaded in rapid succession by a group of seemingly-related IP addresses. I used urlquery.net to find out who owns those IPs and found them all to be Google's.

I came here looking for answers, but in reading what others have said, I realized that Google may be scanning binaries for malware, or at least submitting them to malware detection services for scanning. We know that Google detects and flags malware on web sites, so it's reasonable to assume that doing this involves downloading the files in question.

Google's 'If your site is infected' page says this: 'Use the Fetch as Google tool in Webmaster Tools to detect malware'.

Note also that the files in question do not appear in Google's search results, presumably because I use robots.txt to disallow indexing those files. Assuming I'm right, when Google finds a binary file that is linked from a public web page, it will scan the file for malware, regardless of robots.txt, but will only index the file if it's allowed by robots.txt. I think this is exactly what they should be doing, as long as the scanning is infrequent.

Update: Google seems to be doing this every ten days or so. This is going to affect my bandwidth limits.