Server architecture for high traffic system [duplicate]

This is a canonical question about capacity planning for web sites.

Related:

Can you help me with my capacity planning?

How do you do load testing and capacity planning for databases?

What are some recommended tools and methods of capacity planning for web sites and web-applications?

Please feel free to describe different tools and techniques for different web-servers, frameworks, etc., as well as best-practices that apply to web servers in general.

The short answer is: Nobody can answer this question except you.

The long answer is that benchmarking your specific workload is something that you need to undertake yourself, because it's a bit like asking "How long is a piece of string?".

A simple one-page static website could be hosted on a Pentium Pro 150 and still serve thousands of impressions every day.

The basic approach you need to take to answer this question is to try it and see what happens. There are plenty of tools that you can use to artificially put your system under pressure to see where it buckles.

A brief overview of this is:

Put your scenario in place
Add monitoring
Add traffic
Evaluate results
Remediate based on results
Rinse, repeat until reasonably happy

Put your scenario in place

Basically, in order to test some load, you need something to test against. Set up an environment to test against. This should be a fairly close guess to your production hardware if possible, otherwise you will be left extrapolating your data.

Set up your servers, accounts, websites, bandwidth, etc. Even if you do this on VMs that's OK just as long as you're prepared to scale your results.

So, I'm going to set up a mid-powered virtual machine (two cores, 512 MB RAM, 4 GB HDD) and install my favourite load balancer, haproxy inside Red Hat Linux on the VM.

I'm also going to have two web servers behind the load balancer that I'm going to use to stress test the load balancer. These two web servers are set up identically to my live systems.

Add Monitoring

You'll need some metrics to monitor, so I'm going to measure how many requests get through to my web servers, and how many requests I can squeeze through per second before users start getting a response time of over two seconds.

I'm also going to monitor RAM, CPU and disk usage on the haproxy instance to make sure that the load balancer can handle the connections.

How to do this depends a lot on your platforms and is outside of the scope of this answer. You might need to review web server log files, start performance counters, or rely on the reporting ability of your stress test tool.

A few things you always want to monitor:

CPU usage
RAM usage
Disk usage
Disk latency
Network utilisation

You might also choose to look at SQL deadlocks, seek times, etc depending on what you're specifically testing.

Add traffic

This is where things get fun. Now you need to simulate a test load. There are plenty of tools that can do this, with configurable options:

JMeter (Web, LDAP)
Apache Benchmark (Web)
Grinder (Web)
httperf (Web)
WCAT (Web)
Visual Studio Load Test (Web)
SQLIO (SQL Server)

Choose a number, any number. Let's say you're going to see how the system responds with 10,000 hits a minute. It doesn't matter what number you choose because you're going to repeat this step many times, adjusting that number up or down to see how the system responds.

Ideally, you should distribute these 10,000 requests over multiple load testing clients/nodes so that a single client does not become a bottleneck of requests. For example, JMeter's Remote Testing provides a central interface from which to launch several clients from a controlling Jmeter machine.

Press the magic Go button and watch your web servers melt down and crash.

Evaluate results

So, now you need to go back to your metrics you collected in step 2. You see that with 10,000 concurrent connections, your haproxy box is barely breaking a sweat, but the response time with two web servers is a touch over five seconds. That's not cool - remember, your response time is aiming for two seconds. So, we need to make some changes.

Remediate

Now, you need to speed up your website by more than twice. So you know that you need to either scale up, or scale out.

To scale up, get bigger web servers, more RAM, faster disks.

To scale out, get more servers.

Use your metrics from step 2, and testing, to make this decision. For example, if you saw that the disk latency was massive during the testing, you know you need to scale up and get faster hard drives.

If you saw that the processor was sitting at 100% during the test, perhaps you need to scale out to add additional web servers to reduce the pressure on the existing servers.

There's no generic right or wrong answer, there's only what's right for you. Try scaling up, and if that doesn't work, scale out instead. Or not, it's up to you and some thinking outside the box.

Let's say we're going to scale out. So I decide to clone my two web servers (they're VMs) and now I have four web servers.

Rinse, repeat

Start again from Step 3. If you find that things aren't going as you expected (for example, we doubled the web servers, but the reponse times are still more than two seconds), then look into other bottlenecks. For example, you doubled the web servers, but still have a crappy database server. Or, you cloned more VMs, but because they're on the same physical host, you only achieved higher contention for the servers resources.

You can then use this procedure to test other parts of the system. Instead of hitting the load balancer, try hitting the web server directly, or the SQL server using an SQL benchmarking tool.

Capacity planning starts with measurement, in this case response time versus load. Once you know the degree to which the programs slows down with load, which is NOT a linear function, you can select a response time target, and then discover what resources it will take to meet that target for a given amount of load.

Performance measurement is always done with time units, as

they are what users care about
they can be scaled up and down

Things like %CPU and IOPS are system-specific, so you only use them when you have planned the system and measured it in pre-production, to act as a "surrogate" for the thing you care about, time.

Capacity planning is a troublesome beast. It's as much science as art (if definitely a dark one).

Your best case is that you make well-informed decisions and fortune/luck favors you by having reality meet your assumptions. If your capacity need assumptions match reality, you look like a mystical yogi. Unfortunately, if your assumptions exceed reality, you will appear to have overshot and overspent. More unfortunately, if your assumptions are below the eventual reality (or are otherwise incorrect), you will lack the capacity that you need, and will have to scramble to mitigate the failures of your groaning infrastructure, which makes you look like you lack competency.

No pressure...

Unfortunately, the dark art of capacity planning is more than can be reasonably distilled into a single Server Fault answer; really, it's a topic worthy of books.

Fortunately, there is such a book: "The Art of Capacity Planning"