How often does Google's web spiders crawl the web?

Solution 1:

Google's spiders are constantly crawling the web. They have multiple machines which crawl their massive index and add new pages to it all the time.

Reasons it's fast:

  • They have tons of machines doing the crawling at ridiculous speeds
  • They have tons of bandwidth available
  • They already have a giant index of pages to search so it saves time looking for new content. They can request the previously indexed links and parse them for new links to crawl.
  • They have been doing this for years and have fine tuned their crawling algorithm. They continue to work on it to this day to make it even better.
  • Certain sites are indexed more often depending on certain factors, PR (PageRank) being a big one. If your site has a high PR, you'll see it updated quickly. That's why you'll often see Superuser questions turn up in search results minutes after they've been asked.

Edit:

alt text

...among many other factors.

Google has an abundance of space and bandwidth. Don't you worry about them! As of January 2008, Google was sorting (on average) 20PB a day. 20PB (petabytes) is 20,000 terabytes, or 20 million gigabytes. Now that's just sorting, it isn't all of their data, it's a fraction of it.

An interesting question came up while running experiments at such a scale: Where do you put 1PB of sorted data? We were writing it to 48,000 hard drives (we did not use the full capacity of these disks, though), and every time we ran our sort, at least one of our disks managed to break (this is not surprising at all given the duration of the test, the number of disks involved, and the expected lifetime of hard disks). To make sure we kept our sorted petabyte safe, we asked the Google File System to write three copies of each file to three different disks.

Simply incredible.

Solution 2:

I suspect google uses a few extra signals to decide to re-crawl.

Account activity in analytics or google webmaster tools, twitter activity, search activity, toolbar activity, chrome url completion, perhaps requests to their dns service.

Then they need to look up when a listing page was last updated, and if so mine it for newly created pages. The sitemap is the preferred listing page (SuperUser has one), then feeds, then the home page which tends to list recent pages and therefore to be updated whenever another page is.

Solution 3:

Google's crawling frequency is defined by many factors such as PageRank, links to a page, and crawling constraints such as the number of parameters in a URL.

and here's an excellent article on how it is done:

The Anatomy of a Large-Scale Hypertextual Web Search Engine