How to improve Windows Server 2008 R2 to handle many connections?

It has been a few days so far that I am trying to figure how to solve this problem. First of all, I am running a website with an average daily page view of 350,000. Previously, all ads management (tracking click and impression that each ads has served) and content were served in a single server with the following spec:

Server 1
OS: Windows 2008 R2 64-Bit
CPU: Intel® Core™ i5 - 4 cores
RAM: 8 GB
Storage: 2 x 1 TB hard drives
Bandwidth: 10 TB per month

To improve our website speed, I decided to separate the ads management script to another dedicated server because we have more than 15 advertisers to 30 advertisers per each page.

Server 2
OS: Windows 2008 R2 64-Bit
CPU: Intel® Core™ i5 - 4 cores
RAM: 4 GB
Storage: 2 x 300 GB hard drives
Bandwidth: 10 TB per month

The Problem
The problem is that Server 1 can handle both content and ads system. Now, that I take away the ads system and put it at Server 2. Server 2 can barely serve only ads system.

Test

  • First of all, I moved 75% of the ads to Server 2. And then, perform a ping to server: ping -t xxxxx. [I did the ping for 10 minutes and its following similar pattern as below]
Reply from xxxxx bytes=32 time=290ms TTL=116
Reply from xxxxx bytes=32 time=289ms TTL=116
Reply from xxxxx bytes=32 time=320ms TTL=116
Reply from xxxxx bytes=32 time=286ms TTL=116
Reply from xxxxx bytes=32 time=286ms TTL=116
Reply from xxxxx bytes=32 time=348ms TTL=116
Reply from xxxxx bytes=32 time=284ms TTL=116
  • Then, I moved 100% of the ads to Server 2. Then, perform a ping to server again. [I did the ping for 10 minutes and its following similar pattern as below]
Reply from xxxxx bytes=32 time=290ms TTL=116
Request timed out
Reply from xxxxx bytes=32 time=320ms TTL=116
Reply from xxxxx bytes=32 time=286ms TTL=116
Request timed out
Request timed out
Reply from xxxxx bytes=32 time=284ms TTL=116

Attempts

  1. Increase MaxUserPort and TcpNumConnection
  2. Restart the server
  3. Increase IIS Max Instances and Instance MaxRequests

Server Resource

  • Only 10%-15% of the network connection is used
  • Only 10%-15% of the CPU is used
  • Only 25% of the memory is used

Solution 1:

Well, let's start. This is longer.

You totally misjudged the facts here it looks. Windows - even the outdated 2008 R2 which you should update ASAP - is completely capable of handling a volume my mobile phone has no problem handling.

So, that leaves 3 possible areas of issues:

  • Installation. Your drivers may be crappy. Given you run an outdated operating system - how good are your drivers? Update them - this CAN cause all kinds of issues.

  • Network. This seriously looks like "My car is too slow, please help me make it faster" when the problem is you spend most time in a traffic jam and complaint about the traffic not moving. Not a car tuning problem. 10tb traffic say nothing about the network congestion. Watch your network traffic statistics on your NIC and then react accordingly - if they re not topped out at the speed they should be.... your provider has oversold. Simple like that.

  • Code. Could be you need more RAM (computer is busy swapping out to RAM instead of processing) or crappy coding is using all your CPU to a degree that makes the kernel level TCP stack not react properly (yes, ICMP replies are that low). This would be brutal - but it is another avenue to check. It could also be that you overload the discs by accessing them too often instead of caching in RAM, but I somehow fail to see that leading to lost pings. Any issue here is not something an admin can handle, though - you have to throw hardware at it, or take a stick and hit the programmer with it until he fixes it (if it is a "stupid" level mistake that eats the performance - if it is not, then it is a lot harder to make serious gains and it may just be your need beefier hardware).

It definitely requires no tuning of windows - a well configured windows can deliver a LOT more than that. My file servers regularly ß over longer periods of time - deliver 4-6gigabit from a relatively stock setup.

Now, all the numbers you give say nothing Seriously.

  • 10-15% CPU is used COULD mean swapping.
  • 25% memory is used likely is a good indicator now swapping happens, but it could still mean the CPU is waiting for IO.
  • 10%-15% network is used means - absolutely nothing because it is only YOUR Side of the network. What about upstream? What if the provider is putting 20 servers with 1 gigabit on a 1 gigabit uplink from the rack and that is overflowing like hell?

The last point is quite likely - dropped packets are a good indicator of that. And this will not be visible for you.

My advice.... turn off anything on a machine for a moment, make a speed test from external with a large static file. I would bet you run into congestion higher up.

Anything you did so far - maxuserport, tcpnumconnection, restarting the server, playing around with IIS settings - is totally off and do nothing in the best place. Banging a hammer on a slow car never fixes anything - especially if the car is slow because it stands in a traffic jam. I would undo all the changes and start analyzing the problem, not only your server. I would bet on network congestion at the moment.