How to NOT get so many apache CLOSE_WAIT connections?

netstat shows that there are 153 connections are on status CLOSE_WAIT. The connections never gets closed. So overtime the server is filled with these connections which fills RAM and now the websites are not loading.

netstat shows many like the following:

tcp      160      0 my_server_name:http         my_server_name:51584        CLOSE_WAIT
tcp      160      0 my_server_name:http         my_server_name:51586        CLOSE_WAIT
tcp        0      0 my_server_name:http         my_server_name:50827        CLOSE_WAIT
tcp        0      0 my_server_name:http         my_server_name:50830        CLOSE_WAIT
tcp      312      0 my_server_ip.static.:http rate-limited-proxy-72:61249 CLOSE_WAIT
tcp      382      0 my_server_ip.static.:http b3090792.crawl.yahoo.:58663 CLOSE_WAIT
tcp      382      0 my_server_ip.static.:http b3090792.crawl.yahoo.:34655 CLOSE_WAIT
tcp      382      0 my_server_ip.static.:http b3090792.crawl.yahoo.:56681 CLOSE_WAIT
tcp      382      0 my_server_ip.static.:http b3090792.crawl.yahoo.:40829 CLOSE_WAIT
tcp      576      0 my_server_ip.static.:http b3090792.crawl.yahoo.:38278 CLOSE_WAIT
tcp       47      0 my_server_ip.static.:http 203.200.5.143.ill-bgl:49379 CLOSE_WAIT

If I look at the appache error_log, before the CLOSE_WAIT situation comes there are lines like the following

[warn] child process 15670 still did not exit, sending a SIGTERM
[error] child process 15670 still did not exit, sending a SIGKILL
[notice] child pid 3511 exit signal Segmentation fault (11)

My setup Apache 2.2.3 RAM 1024 MB (burst 2048 MB) CentOS release 5.3 (Final) running 2 WPMU 2.9.2 installations


Background

Socket enters the CLOSE_WAIT state when the remote end terminates the connection sending a packet with the FIN flag set. It then waits in this state for the local application to close() the socket and then sends its own FIN to the client and transitions the socket to the LAST_ACK state. See also the TCP state transition diagram and RFC 793.

Note also that CLOSE_WAIT is unrelated to the infamous TIME_WAIT since the former occurs on the passive close branch (remote end closes first) while the latter on the active close branch (local end closes first).

Problem description

Normally connections transition from CLOSE_WAIT to LAST_ASK fairly quickly. If the remote address and port keep changing fast then a fair number of connections in CLOSE_WAIT state may simply be the consequence of a very large number of connections being open, used and closed. System performance should be examined, but in and of itself this does not constitute a problem.

If the remote address and port change slowly, it indicates that the application processes need to wait for CPU in which case high load averages will confirm this.

If on the other hand the remote address and port stay constant and the number of connections in the CLOSE_WAIT state keeps growing, it most likely indicates a problem with the application. This is a special case of the resource leakage bug: the application leaks open sockets instead of timely closing them. This consumes kernel memory and will eventually make the application fail once it reaches the maximum number of open file descriptors.

Note however that the pace of the leakage may be slow. It is often the case that bugs like this result from a failure to handle an exception in the middle of a request, interrupting the execution flow in a worker thread which may subsequently prevent cleanup (including socket closing). The offending exception may occur rarely.

Temporary solution

Temporary solution to the problem is to increase the limits on open file descriptors and periodically restart the application when (preferably before) the problem starts to affect performance. Note that this may inadvertently impact currently opened connections. Existence of redundant servers and load balancing can help hide the problem from the users.

Permanent solution

Permanent solution to the problem is to deploy the version of the application without the bug. The degree to which the temporary solution harms the users and business, the readiness of the patched release and the state of the last working release help decide whether to rollback to the last working version of the application or wait for the fix.