select() hangs due to resource exhaustion - but what resource?

Connecting to my server via sftp sometimes results in a hang here:

if (select(max+1, rset, wset, NULL, NULL) < 0) {

which is line 1428 from openssh 5.2p1's sftp-server.c (main loop of sftp_server_main()).

The same hang occurs when opening a data connection over e.g. vanilla FTP. I am sometimes able to get through after a number of seconds or minutes, but sometimes the connection times out on the client side before the server is able to respond. When the server does respond and I am connected, then if I issue e.g. 'ls' it will hang again at the select() for some time.

ssh is OK; can connect with no delay and issue commands, etc.

I don't think it's socket death:

root@dl:~# cat /proc/net/sockstat
sockets: used 304
TCP: inuse 444 orphan 302 tw 152 alloc 451 mem 5280
UDP: inuse 4
RAW: inuse 0
FRAG: inuse 0 memory 0

root@dl:~# netstat -tan | awk '{print $6}' | sort | uniq -c
    2 CLOSE_WAIT
  121 CLOSING
    1 established)
  109 ESTABLISHED
    17 FIN_WAIT1
    9 FIN_WAIT2
    1 Foreign
  300 LAST_ACK
    20 LISTEN
    2 SYN_RECV
  433 TIME_WAIT

It also doesn't seem to be out of file descriptors but I'm not 100% sure on that. And even if it were, wouldn't that produce an error, not hang?

It does seem to be somewhat related to the number of connections nginx is serving. I can shut down nginx and the problem goes away. Having said this, nginx and apache are able to coexist in this state with no problem (apache never hangs). People can also connect to an IRC server on the same machine with no problem during these "episodes". So maybe it is limited to select()?

What resource is nginx using that is not sockets/file descriptors that is causing select() to hang? I am pulling my hair out over this.

I've tried all of the usual network tuning stuff (the various settings through sysctl, reducing the timeouts), all with no effect. The machine is not out of RAM and CPU and I/O are both fine.

Linux dl 2.6.26-2-486 #1 Sat Jun 11 14:47:34 UTC 2011 i686 GNU/Linux

It's running Debian Lenny.

What might cause select() to hang checking some sockets?


Two things:

  1. A bug in the code calling 'select'.

  2. No information has been received yet.