What hardware & software consideration does a major website require to properly manage 1000+ servers? [closed]

Sorry for such a high level question. I understand the basics of server load balancing, but the concept of managing 30,000 servers is a bit foreign to me. Is it really just the same concept of balancing 2 or 3 servers scaled up 10,000 times?

How does this relate to things like memcached, sql/mysql, search engines, etc?

Is it a heirarchy system of having 'controller' servers and slave servers that deliver the data based on this? How is redundancy handled?

Thanks for any info or direction to an article on the matter.

EDIT Thanks for the responses guys. My post was closed, but i've revamped the title, hopefully it will be reopened as I find the problem solving process involved with these super high level data solutions to be fascinating, and I am currently build an api that will require some basic load balancing, hence the question.


Most of the software stack that Google uses on their servers was developed in-house. To lessen the effects of unavoidable hardware failure, software is designed to be fault tolerant.

Source: Google Platform

After reading the article I am guessing it is the same concept as balancing the load between few servers scaled up to 1000+ servers by using in-house software stack developed in house on top of Linux. e.g GFS(Google File System), BigTable - Structured storage system built upon GFS

This link describes how they balance network load.

They use Load balancing switches to distribute the load. All requests for the Web site arrive at a machine that then passes the request to one of the available servers. The switch can find out from the servers which one is least loaded, so all of them are doing an equal amount of work.

Google's Network Topology is as following:

When a client computer attempts to connect to Google, several DNS servers resolve www.google.com into multiple IP addresses via Round Robin policy. Furthermore, this acts as the first level of load balancing and directs the client to different Google clusters. A Google cluster has thousands of servers and once the client has connected to the server additional load balancing is done to send the queries to the least loaded web server.


The big part here is, if the software isn't designed to scale, how can it? For example, one of Facebook's biggest restrictions right now is their reliance on MySQL--they've been able to skirt the issue by throwing more and more machines at it, but their own engineer calls it "a fate worse than death."

Typically, you'll need to be able to load balance requests--and many projects, open source or otherwise, are designed. But this comes with overhead, including writing logs, delayed writes, and "eventually consistent" architectures. On other words, scaling doesn't come cheap.

So things like web servers, which are serving static content, can easily be parallelized. Memcached and other caching systems are easily load balanced. But how do you change single points of failures? How does your single, large, relational database scale? What about file stores? In essence, this is an entire branch of research... not something that can be answered by a single question.


I think the same concepts should be the same and the critical point is how you distribute the load and data among the available resources and how you locate your data.

One way is the geographical distribution of servers. Each user will be directed to the nearest server.

A registry-like service can be used to lookup the requested data.

Think of DNS service implementation. It holds a very huge distributed database. Root nodes direct users to other lower level nodes and so on until you reach the responsible node which can answer your query.