Windows 2008 Server SP2 64bit - TCP Connections never releasing after TIME_WAIT

We have an issue with Windows 2008 Datacenter edition SP2 64bit. We have a process that is polling very frequently and establishing new TCP connections. The system gets in a state where we end up with over 16k connections in TIME_WAIT state. The default OS timeout is 120 seconds after which these connections should go away, but that never happens. These connections persist and never get cleaned up even after the originating process has long terminated (we are still at 16k connections two days after the process was killed). The OS is supposed to time them out but it doesn't.

Has anyone else seen this behavior and if so what was done to resolve it. We are aware of how to tune the tcp stack to make the timeout shorter or allow more connections but this is not the issue here.

Thanks!


Solution 1:

Amazon EC2 had a major problem with this. They recently fixed the bug. Maybe the same problem applies in your situation?

Hi, I am pasting below an explanation of what was causing this issue. Good news is that this has been fixed very recently by our engineering team. To get fix, all you'll have to do is STOP/START the Windows Server 2008 instances where you are seeing this issue. Again, I am not talking about REBOOT which is different. STOP/START causes the instance to move to a different (healthy) host. When these instances launch again, they will be running on hosts that have the fix in place so they won't have this issue again. Now below is the engineering explanation of this issue. After an in depth investigation, we've found that when running Windows 2008 x64 on most available instance types, we've identified an issue which may result in TCP connections that remain in TIME_WAIT/CLOSE_WAIT for excessively long periods of time (in some cases, remaining in this state indefinitely). While in these states, the particular socket pairs remain unusable and if enough accumulate, will result in port exhaustion for the ports in question. If this particular circumstance occurs, the only solution to clear the socket pairs in question is to reboot the instance in question. We have determined the cause to be the values produced by a timer function in Windows 2008 kernel API which, on many of our 64-bit platforms, will occasionally retrieve a value that is extremely far in the future. This affects the TCP stack by causing the timestamps on the TCP socket pairs to be stamped significantly far in the future. According to Microsoft, there is a stored cumulative counter which will not be updated unless the value produced by this API call is larger than the cumulative value. The ultimate result is that sockets created after this point will all be stamped too far in the future until that future time is reached. In some cases, we have seen this value several hundred days into the future, thus the socket pairs appear to be stuck forever.

Solution 2:

There is a Microsoft Article that describes a few ways to resolve this. It commonly comes from Applications that are badly coded and do not close ports correctly. You need to look at what applications you have installed, or what tasks you are performing and disabling these to see which is causing the issue.

To fix the issue, you want to be looking at either;

  1. Increase the upper range of ephemeral ports that are dynamically allocated to client TCP/IP socket connections.
  2. Reduce the client TCP/IP socket connection timeout value from the default value of 240 seconds (A more permanent fix)