Group of workstations refuse to SSH to one another, but can connect just fine in and out to others

I have a group of Ubuntu 14.04 workstations that are all clones deployed via image. After imaging, they are assigned a static IP and they also delete/regenerate their host keys. Each of them on their own seems fine, they can connect to the file server, computing cluster, etc. just fine, but they can't connect to each other via SSH, and just hang endlessly (until they time out after ~10 minutes) on debug1: SSH2_MSG_KEXINIT sent.

I can connect to them all just fine via a Windows laptop, or an older 12.04 workstation, or a 12.04 server, or from one problem workstation to another by hopping through anything else, but not like-to-like.


Solution 1:

I was able to reproduce the symptom by introducing a PMTU black hole in my network connectivity.

On Ubuntu 12.04 the key exchange init message is about 1KB in size, which is well within the typical network MTU. On Ubuntu 14.04 the key exchange init message has grown to almost 2KB in size making it the first message during the connection to exceed a typical network MTU.

This means the symptom of an MTU issue will have changed. Earlier you may have been able to connect, but transferring a large file over the connection or running a command which produce lots of output quickly, could stall the connection. But now an MTU issue cause the connection to stall before authentication.

Reducing the MTU on the network interface is a usable workaround until you identify the root cause. It can also be used to confirm that the problem is indeed MTU related. If the network interface is called eth0, you could try this command:

ifconfig eth0 mtu 1280