Suggestions on large scale web applications architecture?

It's a big question:) We are running a website with LAMP that is not big, 5 web servers with LVS load balancing, 3 MySQL servers with replications and separation of reading and writing, and we use Memcached for caching and some full-text searching tools. So far it works well because we do not have a large traffic for the moment.

But when the users are growing rapidly, we will have to scale our architecture to satisfy the needs. Maybe distributed file system and database (and parallel computing?) will be introduced, and also some techniques for clustering and maintenance (like Gearman and Pshell).

There are some articles on the net that I can go through with. But I really need some practical experiences to prepare for this issue feasibly and efficiently .


There are many methods for scaling web applications and the supporting infrastructure. Cal Henderson wrote a good book on the subject called "Building Scalable Web Sites". It was based on his experiences with Flickr. Unless you grow slowly you will run into the same kinds of growth issues many others have seen. Scaling is, like many other subjects, a journey not a destination.

The first steps are to make everything repeatable, measurable, and manageable. By repeatable I mean use tools like FAI or kickstart to install the OS and something like puppet or cfengine to configure the machines once a base OS is installed. By measurable I mean use something like cacti, cricket, or ganglia to monitor how your cluster is performing now. Measure not just things like load average but how long it takes to render a page or service a request. Neither of these seem critical when starting out but should tell you before your system falls over from load and can make it simple to add 10 or a hundred machines at once. Base your growth plans on data not guesses.

Manageable means putting the tools in place to allow you automatically generate and test as much of your configuration as you can. Start with what you have and grow it. If you are storing machine info in a database, great. If not you probably have a spreadsheet you can export. Put your configuration in some sort of source control if you haven't already. Automatically creating configurations from your database allows you to grow with less stress. Testing them before they go live on your servers can save you from having a service not start because of typos or other errors.

Horizontal methods assume that you can repeat things appropriately. Think about your application. What areas make sense to split up? What areas that can be handled by many machines in parallel? Does latency affect your application. How likely are you to run into connection limits or other bottlenecks? Are you asking your web servers to also handle mail delivery, database, or other chores?

I've worked in environments with hundreds of web servers. Things should get split differently for different types of load. If you have large collections of data files that rarely change, partitioning them away from the actively changing "stuff" may give more headroom for serving both static and dynamic data. Different tools work better for differing loads. Apache and Lighttpd work well for some things, Nginx works better for others.

Look at proxies and caches. Both between your users and the application and between parts of the application. I read that you are already using memcache, that helps. Putting a reverse proxy like perlbal or pound between your load balancer and the web servers may make sense depending on your application traffic.

At some point you may discover that MySQL master <-> (N * slave) replication isn't keeping up and that you need to partition the databases. Partitioning your databases may involve setting up another layer of data management. Many people use another database with memcache for this management. At one place I worked we used master <-> master replicated pairs for most data and another pair with 10 read slaves for pointers to the data.

This is just a very bare bones description of some of the issues I've run into in working at sites with hundreds of machines. There are no end of things that creep up growing from a few machines to a few hundred. I'm sure the same holds true for growth into the thousands.


There's a lot of good literature about this topic that has come out recently. Start at High Scalability, much of the best stuff is liked to from there. You can take a look at Digg's Tech Blog for insights into how we do things, and you can also reach out to resources like SAGE - the folks on the SAGE lists are a great resource.