Mysterious “fragmentation required” rejections from gateway VM

Solution 1:

You can't drop ICMP fragmentation required messages. They're required for pMTU discovery, which is required for TCP to work properly. Please LART the firewall administrator.

By the transparency rule, a packet-filtering router acting as a firewall which permits outgoing IP packets with the Don't Fragment (DF) bit set MUST NOT block incoming ICMP Destination Unreachable / Fragmentation Needed errors sent in response to the outbound packets from reaching hosts inside the firewall, as this would break the standards-compliant usage of Path MTU discovery by hosts generating legitimate traffic. -- Firewall Requirements - RFC2979 (emphasis in original)

This is a configuration that has been recognized as fundamentally broken for more than a decade. ICMP is not optional.

Solution 2:

I finally got to the bottom of this. It turned out to be an issue with VMware's implementation of TCP segmentation offloading in the virtual NIC of the target server.

The server's TCP/IP stack would send one large block along to the NIC, with the expectation that the NIC would break this into TCP segments restricted to the link's MTU. However, VMware decided to leave this in one large segment until - well, I'm not sure when.

It seems it actually stayed one large segment when it reached the gateway VM's TCP/IP stack, which elicited the rejection.

An important clue was buried in the resulting ICMP packet: the IP header of the rejected packet indicated a size of 2960 bytes - way larger than the actual packet it appeared to be rejecting. This is also exactly the size a TCP segment would be on the wire if it had combined the data from both of the segments sent thus far.

One thing that made the issue very hard to diagnose was that the transmitted data actually was split into 1500-byte frames as far as WireShark running on another VM (connected to the same vSwitch on a separate, promiscuous port group) could see. I'm really not sure why the gateway VM saw one packet while the WireShark VM saw two. FWIW, the gateway doesn't have large receive offload enabled - I could understand if it did. The WireShark VM is running Windows 7.

I think VMware's logic in delaying the segmentation is so that if the data is to go out a physical NIC, the NIC's actual hardware offload can be leveraged. It does seem buggy, however, that it would fail to segment before sending into another VM, and inconsistently, for that matter. I've seen this behaviour mentioned elsewhere as a VMware bug.

The solution was simply to turn off TCP segmentation offloading in the target server. The procedure varies by OS but fwiw:

In Windows, on the connection's properties, General tab or Networking tab, click "Configure..." beside the adapter, and look on the Advanced tab. For Server 2003 R2 it's given as "IPv4 TCP Segmentation Offload." For Server 2008 R2 it's "Large Send Offload (IPv4)."

This solution is a bit of a kludge and could conceivably impact performance in some environments, so I'll still accept any better answer.

Solution 3:

I had the same symptoms and the problem turned out to be this kernel bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=754294