This is a bug in the Linux kernel IPsec. It fails to account for the size of the transport-mode ESP encapsulation when deciding whether to fragment the outgoing packet; it's then dropped on output as it exceeds the interface MTU. I don't know whether this has been fixed in newer kernels.