Debugging "clogged" TCP connections

I'm having trouble with an internet connection that seems to randomly "freeze" arbitrary tcp connections when they have not been used for a while. The connections stay established, but no data is coming through.

When this happens, netstat still shows the connection status as ESTABLISHED on both the local computer:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name Timer
tcp        0     53 192.168.0.10:41129      173.255.235.238:143     ESTABLISHED 8219/gnutls-cli  on (79.31/13/0)

..and the remote server:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name Timer
tcp        0      0 173.255.235.238:143     68.5.174.98:41129       ESTABLISHED 5303/imapd       off (0.00/0/0)

However, it seems that no data at all is transferred. If I run strace on the local and remote process, both just show a repeating sequence of select calls (with different fds of course), e.g.

select(6, [0 5], NULL, NULL, {0, 50000}) = 0 (Timeout)
select(6, [0 5], NULL, NULL, {0, 50000}) = 0 (Timeout)
select(6, [0 5], NULL, NULL, {0, 50000}) = 0 (Timeout)

The internet connection overall does not seem affected, I can still establish new connections to the same service on the same server without any problems. However, the affected local applications seem to be unaware of the problem and just hang.

About 10 minutes after the attempted transmission on the local end, the connection on the remote end disappears from the netstat (I wasn't able to catch any intermediate state), but still stays ESTABLISHED on the local end.

Finally, after some more minutes, the local application aborts with a timeout and disappears from the local netstat output as well.

When I look at a packet capture of this connection on the client side, there is a long (expected) period of inactivity that seems to trigger the problem, then the local end tries to transmit some data again but never receives an ACK. Instead, 15 TCP Retransmissions go out, with intervals increasing from 0.3 seconds to 120 seconds. No activity is captured after that.

Does anyone have a suggestion of how I could debug this further to find out where the problem lies and how to fix it?

Additionaly and/or as a temporary workaround: is is there some way to globally reduce the timeout on client and/or server to reduce the time before the local application aborts?


Summarizing from the debian-user thread:

These symptoms are consistent with some NAT device sitting between client and server and dropping idle connections after 300 seconds.

There must be a NAT device somewhere in the chain, because the client's idea of his ip address (192.168.0.10) differs from the one that the server uses to send data to the client (68.5.174.98). Also, the 192.168.x.y network is reserved for local use.

A workaround is to enable TCP keep-alive. Unfortunately this needs to be configured in every program separately (e.g. using the ServerAliveInterval option in ssh). However, under Linux the libkeepalive library can be used with LD_PRELOAD to activate the necessary socket option even for programs that normally don't support it.

For me, a better solution was to replace the responsible Cisco DPC3825 cable gateway with a NetGear CMD31T cable modem and NetGear WGR614v9 gateway. The former also does NAT, but does not have such a ridiously short timeout.