TIME_WAIT connections not being cleaned up after timeout period expires

I am stress testing one of my servers by hitting it with a constant stream of new network connections, the tcp_fin_timeout is set to 60, so if I send a constant stream of something like 100 requests per second, I would expect to see a rolling average of 6000 (60 * 100) connections in a TIME_WAIT state, this is happening, but looking in netstat (using -o) to see the timers, I see connections like:

TIME_WAIT   timewait (0.00/0/0)

where their timeout has expired but the connection is still hanging around, I then eventually run out of connections. Anyone know why these connections don't get cleaned up? If I stop creating new connections they do eventually disappear but while I am constantly creating new connections they don't, seems like the kernel isn't getting chance to clean them up? Is there some other config options I need to set to remove the connections as soon as they have expired?

The server is running Ubuntu and my web server is nginx. Also it has iptables with connection tracking, not sure if that would cause these TIME_WAIT connections to live on.

Thanks Mark.


Solution 1:

This problem was interesting as I've often wondered myself. I did a couple tests and found some interesting results. If I open up one connection to a server and wait 60 seconds it was invariably cleaned up(never got to 0.00/0/0). If I opened 100 connections, they too were cleaned up after 60 seconds. If I opened 101 connections I would start to see connections in the state you menitoned(that I've also seen before). And they appear to last roughly 120s or 2xMSL(which is 60) regardless of what fin_timeout is set to. I did some digging in the kernel source code and found what I believe is the 'reason'. There appears to be some code that tries to limit the amount of socket reaping that happens per 'cycle'. The cycle frequency itself is set on a scale based on HZ:

linux-source-2.6.38/include/net/inet_timewait_sock.h:
     35 #define INET_TWDR_RECYCLE_SLOTS_LOG     5
     36 #define INET_TWDR_RECYCLE_SLOTS         (1 << INET_TWDR_RECYCLE_SLOTS_LOG)
     37 
     38 /*
     39  * If time > 4sec, it is "slow" path, no recycling is required,
     40  * so that we select tick to get range about 4 seconds.
     41  */
     42 #if HZ <= 16 || HZ > 4096
     43 # error Unsupported: HZ <= 16 or HZ > 4096
     44 #elif HZ <= 32
     45 # define INET_TWDR_RECYCLE_TICK (5 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
     46 #elif HZ <= 64
     47 # define INET_TWDR_RECYCLE_TICK (6 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
     48 #elif HZ <= 128
     49 # define INET_TWDR_RECYCLE_TICK (7 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
     50 #elif HZ <= 256
     51 # define INET_TWDR_RECYCLE_TICK (8 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
     52 #elif HZ <= 512
     53 # define INET_TWDR_RECYCLE_TICK (9 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
     54 #elif HZ <= 1024
     55 # define INET_TWDR_RECYCLE_TICK (10 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
     56 #elif HZ <= 2048
     57 # define INET_TWDR_RECYCLE_TICK (11 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
     58 #else
     59 # define INET_TWDR_RECYCLE_TICK (12 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
     60 #endif
     61 
     62 /* TIME_WAIT reaping mechanism. */
     63 #define INET_TWDR_TWKILL_SLOTS  8 /* Please keep this a power of 2. */
The number of slots is also set here:
     65 #define INET_TWDR_TWKILL_QUOTA 100

In the actual timewait code you can see where it uses the quote to stop killing off TIME_WAIT connections if its already done too many:

linux-source-2.6.38/net/ipv4/inet_timewait_sock.c:
    213 static int inet_twdr_do_twkill_work(struct inet_timewait_death_row *twdr,
    214                                     const int slot)
    215 {
...
    240                 if (killed > INET_TWDR_TWKILL_QUOTA) {
    241                         ret = 1;
    242                         break;
    243                 }

Theres more information here on why HZ is set to what it is: http://kerneltrap.org/node/5411 But it isn't uncommon to increase it. I think however its usually more common to enable tw_reuse/recycling to get around this bucket/quota mechanism(which seems confusing to me now that I've read about it, increasing HZ would be a much safer and cleaner solution). I posted this as an answer but I think there could be more discussion here about what the 'right way' to fix it is. Thanks for the interesting question!

Solution 2:

Instead of using tcp_tw_recycle = 1 use the following:

tcp_tw_reuse = 1

Recyle reports to be broken and in some cases does not work when you are using NAT or load balancing.