How are large websites served to millions of users? (e.g. Google)

I appreciate this question might be vague/too broad, but I'm looking for the basic principles/a summary.

How does a site like Google, or Facebook, for example, deal with the billions of page views it receives?

I'm aware of round-robin DNS, which, I understand serves one IP to visitor A, then another IP to visitor B in a round-robin fashion, and so on.

Do these sites operate several(hundred?) servers, that have a copy of the "google" website on each server, and are all synchronised?

To try and summarise - how do very large sites with millions of page views actually deal with the traffic? How are they maintained? And where would one go to get experience for setting this up?

I would like to find out more, but without actually having the need for such a set up, I am finding it difficult to get case studies or material to learn more.

Hope this makes some degree of sense. Thanks.


Summary: big enterprise customers such as airline flight planning uses Oracle, Sun, IBM Bladecenters and custom code, big companies like eBay, Twitter, Facebook, Google use everything-custom, whatever they can make work, and keep it secret too, because it's one of the very hard things that they've had to solve to make their company possible at all.

--

Small webservers have become very common, and you'll typically see a webserver like Apache, Tomcat or IIS, and maybe with a database behind it (PostgreSQL, SQL Server or MySQL), and maybe with a programming layer in there too (PHP, Python, Ruby, Java, etc).

For bigger but still small setups, you separate those layers into different servers - two running Apache both looking at the same shared file, two running the database with half the data in each, maybe another doing caching, or maybe you just make them as powerful as you can afford. This can get you a long way - Plenty of Fish got into the HitWise top 100 websites in 2007, serving 2 million+ views per hour, with 1 server and outsourcing image hosting to Akamai.

If you are rich, e.g. the government, the airline industry, etc. you can scale up from here by going to massive and specialist servers such as bladecenters, tens-of-processor Sun servers, tens of disks in a storage device and Oracle databases, etc.

For everyone else, the question of how to scale up on the cheap is still unanswered. How they do it is one of the core problems of their company, and one they'll spend a lot of effort custom-building.

It will likely consist of interesting ways to get many database servers involved. Not Google, they wrote their own filesystem and database replacement on top. You might see sharding (split your content A-M in one server, N-Z in another) or replication (all servers have the same data, reads come from any of them, writes go to all of them) or something custom.

It will probably consist of a lot of caching servers, e.g. running Memcached. These will have lots of RAM and quickly return database queries which have been done before recently, files which have been requested recently. In 2008, Facebook said "We use more than 800 (memcached) servers supplying over 28 terabytes of memory to our users." link

You'll probably find some CDN service (content delivery network) such as Akamai, where you give them all your pictures and they spread them around the world and you link to them and automatically serve up the the nearest to the user from their network.

You'll also find a lot of custom code and people working hard but keeping it a secret. Serving eBay auctions means handling a lot of traffic but the data for a single auction item is mostly static, but searching eBay auctions means a lot of data processing. Google searching the web means a lot of data processing, but in a different way - different data stored on a different server. Facebook means a lot of information travelling criss-cross to a lot of users, and Twitter does as well, but with different characteristics. Google and Facebook design their own server hardware.


They have many different locations and all users are directed to the nearest location. This is done with Anycast.

In each location they have then many front end servers (web servers) and in the backend are several different database clusters. Often they are doing database sharding there.

Often there is a layer in between the front end severs and the backend database servers. In this layer all the calculation and data processing is going on. Google is using there Map Reduce.

This is a very short introduction but the links should help you go find more detailed information.


How does a site like Google, or Facebook, for example, deal with the billions of page views it receives?

Many many servers in many many data centers.

I'm aware of round-robin DNS, which, I understand serves one IP to visitor A, then another IP to visitor B in a round-robin fashion, and so on.

Ah. No. it returns different IP's (round robin) on every request, but this noct necessarily means every visitor.

Do these sites operate several(hundred?) servers, that have a copy of the "google" website on each > server, and are all synchronised?

No. Make this "tens of tousands" of servers and the answre is yes. Google operates a LOT of data centers with 100.000+ servers IN EACH.

And they use AS routing to redirect traffic to the closest data center.