SSH new connection begins to hang (not reject or terminate) after a day or so on Ubuntu 13.04 server

Recently we upgraded the server from 12.04 LTS server to 13.04. All was well, including after a reboot. With all packages updated we began to see a strange issue, ssh works for a day or so (unclear on timing) then a later request for SSH hangs (cannot ctrl+c, nothing).

It is up and serving webserver traffic etc.

Port 22 is open (ips etc altered slightly for posting):

nmap -T4 -A x.acme.com

Starting Nmap 6.40 ( http://nmap.org ) at 2013-09-12 16:01 CDT
Nmap scan report for x.acme.com (69.137.56.18)
Host is up (0.026s latency).
rDNS record for 69.137.56.18: c-69-137-56-18.hsd1.tn.provider.net
Not shown: 998 filtered ports
PORT   STATE SERVICE VERSION
22/tcp open  ssh     OpenSSH 6.1p1 Debian 4 (protocol 2.0)
| ssh-hostkey: 1024 54:d3:e3:38:44:f4:20:a4:e7:42:49:d0:a7:f1:3e:21 (DSA)
| 2048 dc:21:77:3b:f4:4e:74:d0:87:33:14:40:04:68:33:a6 (RSA)
|_256 45:69:10:79:5a:9f:0b:f0:66:15:39:87:b9:a1:37:f7 (ECDSA)
80/tcp open  http    Jetty 7.6.2.v20120308
| http-title: Log in as a Bamboo user - Atlassian Bamboo
|_Requested resource was http://x.acme.com/userlogin!default.action;jsessionid=19v135zn8cl1tgso28fse4d50?os_destination=%2Fstart.action
Service Info: OS: Linux; CPE: cpe:/o:linux:linux_kernel

Service detection performed. Please report any incorrect results at http://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 12.89 seconds

Here is the ssh -vvv:

ssh -vvv x.acme.com
OpenSSH_5.9p1, OpenSSL 0.9.8x 10 May 2012
debug1: Reading configuration data /Users/tfergeson/.ssh/config
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to x.acme.com [69.137.56.18] port 22.
debug1: Connection established.
debug3: Incorrect RSA1 identifier
debug3: Could not load "/Users/tfergeson/.ssh/id_rsa" as a RSA1 public key
debug1: identity file /Users/tfergeson/.ssh/id_rsa type 1
debug1: identity file /Users/tfergeson/.ssh/id_rsa-cert type -1
debug1: identity file /Users/tfergeson/.ssh/id_dsa type -1
debug1: identity file /Users/tfergeson/.ssh/id_dsa-cert type -1
debug1: Remote protocol version 2.0, remote software version OpenSSH_6.1p1 Debian-4
debug1: match: OpenSSH_6.1p1 Debian-4 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.9
debug2: fd 3 setting O_NONBLOCK
debug3: load_hostkeys: loading entries for host "x.acme.com" from file "/Users/tfergeson/.ssh/known_hosts"
debug3: load_hostkeys: found key type RSA in file /Users/tfergeson/.ssh/known_hosts:10
debug3: load_hostkeys: loaded 1 keys
debug3: order_hostkeyalgs: prefer hostkeyalgs: [email protected],[email protected],ssh-rsa
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1
debug2: kex_parse_kexinit: [email protected],[email protected],ssh-rsa,[email protected],[email protected],ssh-dss
debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]
debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]
debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-sha2-256,hmac-sha2-256-96,hmac-sha2-512,hmac-sha2-512-96,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-sha2-256,hmac-sha2-256-96,hmac-sha2-512,hmac-sha2-512-96,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit: none,[email protected],zlib
debug2: kex_parse_kexinit: none,[email protected],zlib
debug2: kex_parse_kexinit: 
debug2: kex_parse_kexinit: 
debug2: kex_parse_kexinit: first_kex_follows 0 
debug2: kex_parse_kexinit: reserved 0 
debug2: kex_parse_kexinit: ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1
debug2: kex_parse_kexinit: ssh-rsa,ssh-dss,ecdsa-sha2-nistp256
debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]
debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]
debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit: none,[email protected]
debug2: kex_parse_kexinit: none,[email protected]
debug2: kex_parse_kexinit: 
debug2: kex_parse_kexinit: 
debug2: kex_parse_kexinit: first_kex_follows 0 
debug2: kex_parse_kexinit: reserved 0 
debug2: mac_setup: found hmac-md5
debug1: kex: server->client aes128-ctr hmac-md5 none
debug2: mac_setup: found hmac-md5
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug2: dh_gen_key: priv key bits set: 130/256
debug2: bits set: 503/1024
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: Server host key: RSA dc:21:77:3b:f4:4e:74:d0:87:33:14:40:04:68:33:a6
debug3: load_hostkeys: loading entries for host "x.acme.com" from file "/Users/tfergeson/.ssh/known_hosts"
debug3: load_hostkeys: found key type RSA in file /Users/tfergeson/.ssh/known_hosts:10
debug3: load_hostkeys: loaded 1 keys
debug3: load_hostkeys: loading entries for host "69.137.56.18" from file "/Users/tfergeson/.ssh/known_hosts"
debug3: load_hostkeys: found key type RSA in file /Users/tfergeson/.ssh/known_hosts:6
debug3: load_hostkeys: loaded 1 keys
debug1: Host 'x.acme.com' is known and matches the RSA host key.
debug1: Found key in /Users/tfergeson/.ssh/known_hosts:10
debug2: bits set: 493/1024
debug1: ssh_rsa_verify: signature correct
debug2: kex_derive_keys
debug2: set_newkeys: mode 1
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug2: set_newkeys: mode 0
debug1: SSH2_MSG_NEWKEYS received
debug1: Roaming not allowed by server
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug2: service_accept: ssh-userauth
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug2: key: /Users/tfergeson/.ssh/id_rsa (0x7ff189c1d7d0)
debug2: key: /Users/tfergeson/.ssh/id_dsa (0x0)
debug1: Authentications that can continue: publickey
debug3: start over, passed a different list publickey
debug3: preferred publickey,keyboard-interactive,password
debug3: authmethod_lookup publickey
debug3: remaining preferred: keyboard-interactive,password
debug3: authmethod_is_enabled publickey
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /Users/tfergeson/.ssh/id_rsa
debug3: send_pubkey_test
debug2: we sent a publickey packet, wait for reply
debug1: Server accepts key: pkalg ssh-rsa blen 277
debug2: input_userauth_pk_ok: fp 3c:e5:29:6c:9d:27:d1:7d:e8:09:a2:e8:8e:6e:af:6f
debug3: sign_and_send_pubkey: RSA 3c:e5:29:6c:9d:27:d1:7d:e8:09:a2:e8:8e:6e:af:6f
debug1: read PEM private key done: type RSA
debug1: Authentication succeeded (publickey).
Authenticated to x.acme.com ([69.137.56.18]:22).
debug1: channel 0: new [client-session]
debug3: ssh_session2_open: channel_new: 0
debug2: channel 0: send open
debug1: Requesting [email protected]
debug1: Entering interactive session.
debug2: callback start
debug2: client_session2_setup: id 0
debug2: fd 3 setting TCP_NODELAY
debug2: channel 0: request pty-req confirm 1
debug1: Sending environment.
debug3: Ignored env ATLAS_OPTS
debug3: Ignored env rvm_bin_path
debug3: Ignored env TERM_PROGRAM
debug3: Ignored env GEM_HOME
debug3: Ignored env SHELL
debug3: Ignored env TERM
debug3: Ignored env CLICOLOR
debug3: Ignored env IRBRC
debug3: Ignored env TMPDIR
debug3: Ignored env Apple_PubSub_Socket_Render
debug3: Ignored env TERM_PROGRAM_VERSION
debug3: Ignored env MY_RUBY_HOME
debug3: Ignored env TERM_SESSION_ID
debug3: Ignored env USER
debug3: Ignored env COMMAND_MODE
debug3: Ignored env rvm_path
debug3: Ignored env COM_GOOGLE_CHROME_FRAMEWORK_SERVICE_PROCESS/USERS/tfergeson/LIBRARY/APPLICATION_SUPPORT/GOOGLE/CHROME_SOCKET
debug3: Ignored env JPDA_ADDRESS
debug3: Ignored env APDK_HOME
debug3: Ignored env SSH_AUTH_SOCK
debug3: Ignored env Apple_Ubiquity_Message
debug3: Ignored env __CF_USER_TEXT_ENCODING
debug3: Ignored env rvm_sticky_flag
debug3: Ignored env MAVEN_OPTS
debug3: Ignored env LSCOLORS
debug3: Ignored env rvm_prefix
debug3: Ignored env PATH
debug3: Ignored env PWD
debug3: Ignored env JAVA_HOME
debug1: Sending env LANG = en_US.UTF-8
debug2: channel 0: request env confirm 0
debug3: Ignored env JPDA_TRANSPORT
debug3: Ignored env rvm_version
debug3: Ignored env M2_HOME
debug3: Ignored env HOME
debug3: Ignored env SHLVL
debug3: Ignored env rvm_ruby_string
debug3: Ignored env LOGNAME
debug3: Ignored env M2_REPO
debug3: Ignored env GEM_PATH
debug3: Ignored env AWS_RDS_HOME
debug3: Ignored env rvm_delete_flag
debug3: Ignored env EC2_PRIVATE_KEY
debug3: Ignored env RUBY_VERSION
debug3: Ignored env SECURITYSESSIONID
debug3: Ignored env EC2_CERT
debug3: Ignored env _
debug2: channel 0: request shell confirm 1
debug2: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768

I can hard reboot (only mac monitors at that location) and it will again be accessible. This now happens every single time. It is imperative that I get it sorted. The strange thing is that it behaves initially then starts to hang after several hours. I perused logs previously and nothing stood out.

From the auth.log, I can see that it has allowed me in, but still I get nothing back on the client side:

Sep 20 12:47:50 cbear sshd[25376]: Accepted publickey for tfergeson from 10.1.10.14 port 54631 ssh2
Sep 20 12:47:50 cbear sshd[25376]: pam_unix(sshd:session): session opened for user tfergeson by (uid=0)

UPDATES:

Still occurring even after setting UseDNS no and commenting out #session optional pam_mail.so standard noenv

This does not appear to be a network/dns related issue, as all services running on the machine are as responsive and accessible as ever, with the exception of sshd.

Any thoughts on where to start?


Solution 1:

From the SshAccess page on the GNU Savannah documentation wiki:

A problem can arise when you are trying to connect from behind a NAT router using OpenSSH. During session setup, after the password has been given, OpenSSH sets the TOS (type of service) field in the IP datagram. Some routers are known to choke on this. The effect is that your session hangs indefinitely after you gave your password. Here is the example output from such an ssh session:

user@localhost:~$ ssh -vvv {user-name}@cvs.savannah.gnu.org
OpenSSH_4.7p1 Debian-8ubuntu1.2, OpenSSL 0.9.8g 19 Oct 2007
debug1: Reading configuration data /etc/ssh/ssh_config
[...]
Enter passphrase for key '{homedir}/.ssh/id_rsa':
debug1: read PEM private key done: type RSA
debug1: Authentication succeeded (publickey).
[...]
debug2: fd 5 setting TCP_NODELAY
debug2: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768

and from here on the session hangs.

The fix is to make ssh send all its traffic via netcat, because netcat won't set the TOS field. For this to work, you need to have netcat installed. You can test this by entering at the command line:

user@localhost:~$ which nc

and if you get a path back, like:

/bin/nc

then you probably have netcat installed. For the very cautious, you could also issue:

user@localhost:~$ nc -h

and look at the upcoming help text. If you don't have netcat, you can find it at http://netcat.sourceforge.net/. You may also want to try the packaging system which comes with your operating system distribution.

Once you found that you have netcat installed, issue the following command to test whether the netcat route solves your problem:

ssh -o "ProxyCommand nc %h %p" {user-name}@cvs.savannah.gnu.org

where {user-name} is your savannah login name. For a successfull login, you should get an output similar to this (with no hanging, i.e. you get a prompt afterwards):

user@localhost:~$ ssh -o "ProxyCommand nc %h %p" {user-name}@cvs.savannah.gnu.org
Enter passphrase for key '{home-dir}/.ssh/id_rsa': 
Last login: {datetime} from {ip-adr} 
You tried to execute: 
Sorry, you are not allowed to execute that command. 
Connection to cvs.savannah.gnu.org closed. 
user@localhost:~$

If you find that your login works via the netcat route, then you can make it permanent by adding a directive to the ssh config file ~/.ssh/config (or, if that file doesn't exist, create it):

ProxyCommand nc %h %p

Here's an example ssh config file in a user's home folder (/home/user/.ssh/config):

# This is the ssh client user configuration file.  See
# ssh_config(5) for more information.  This file provides defaults for
# this user, and the values can be changed on the command line.
#
# Configuration data is parsed as follows:
#  1. command line options
#  2. user-specific file
#  3. system-wide file
# Any configuration value is only changed the first time it is set.
# Thus, host-specific definitions should be at the beginning of the
# configuration file, and defaults at the end.
#
# Directive to overcome TOS issue with our NAT router. During session setup,
# OpenSSH sets the TOS (type of service) field after the user has submitted
# the password. Some routers are known to choke on this, with the result
# that the session hangs during buildup. As workaround we send our traffic
# via netcat which doesn't set the TOS field.
ProxyCommand nc %h %p

It's advisable to put the comments as well, otherwise six months later you may find yourself wondering what that directive is all about??

You could also add this directive to your global ssh config file (/etc/ssh/ssh_config), but this change would be system wide, and not all users on your system may appreciate that change.

Solution 2:

ssh -o IPQoS=0 x.acme.com should tell openssh to stop setting the QoS field that's being choked on.

Solution 3:

As ridiculous as this sounds, the only workaround I have at this time is to schedule a nightly reboot. Luckily, this workaround is acceptable only because it is a development machine, had it been a production machine I would be in trouble.

I hate this, but wanted to be sure that others who find this thread know I have no solution. Added this to the root crontab for a nightly reboot at 4 am:

0 4 * * * /sbin/shutdown -r +5

Solution 4:

I had the problem behind a NAT with ssh. I tried "mosh" with ssh options and everything works just perfect. Install mosh