VPN server not responding, no indications of blocked traffic

I have a Mac OS X Server (10.9) machine running a L2TP/IPSec VPN server. Configuration appears to be fine, and the hostname of both the server and the VPN is set to a DynDNS hostname. The server is configured behind an AirPort Extreme router with port forwarding, connected to an Arris modem/router with the router disabled (bridged). The server is configured with a static internal IP and the router also has its MAC address tied via DHCP to guarantee the internal address is consistent.

If I enter the server's internal IP address (10.0.1.x) and try to connect to the VPN server from within the network, everything works fine. But if I enter the external hostname (the DynDNS name) and try to connect, again from inside the network, it fails to connect. It similarly fails to connect when outside the network (via LTE for example).

Other services (SSH, remote desktop, etc.) all connect fine from both within and outside the network. Only VPN is affected. I can confirm the server is reachable from SSH and Remote Desktop (ports 22/5900).

I have further confirmed that the router is forwarding ports 500(UDP), 1701(UDP) and 4500(UDP) in addition to the other ports used by the other services.

When I try to connect, the following appears on the client's console:

12/16/13 11:13:33.213 PM configd[28]: SCNC: start, triggered by (15822) com.apple.prefe, type L2TP, status 0, trafficClass 0
12/16/13 11:13:33.229 PM pppd[15967]: publish_entry SCDSet() failed: Success!
12/16/13 11:13:33.230 PM pppd[15967]: publish_entry SCDSet() failed: Success!
12/16/13 11:13:33.230 PM pppd[15967]: pppd 2.4.2 (Apple version 727.1.15) started by user, uid 501
12/16/13 11:13:33.231 PM pppd[15967]: L2TP connecting to server 'x.x.x.x' (x.x.x.x)...
12/16/13 11:13:33.232 PM pppd[15967]: IPSec connection started
12/16/13 11:13:33.244 PM racoon[15968]: accepted connection on vpn control socket.
12/16/13 11:13:33.244 PM racoon[15968]: Connecting.
12/16/13 11:13:33.244 PM racoon[15968]: IPSec Phase 1 started (Initiated by me).
12/16/13 11:13:33.245 PM racoon[15968]: IKE Packet: transmit success. (Initiator, Main-Mode message 1).
12/16/13 11:13:33.245 PM racoon[15968]: >>>>> phase change status = Phase 1 started by us
12/16/13 11:13:33.416 PM racoon[15968]: >>>>> phase change status = Phase 1 started by peer
12/16/13 11:13:33.416 PM racoon[15968]: IKE Packet: receive success. (Initiator, Main-Mode message 2).
12/16/13 11:13:33.420 PM racoon[15968]: IKE Packet: transmit success. (Initiator, Main-Mode message 3).
12/16/13 11:13:33.429 PM racoon[15968]: IKE Packet: receive success. (Initiator, Main-Mode message 4).
12/16/13 11:13:33.447 PM racoon[15968]: IKE Packet: transmit success. (Initiator, Main-Mode message 5).
12/16/13 11:13:36.715 PM racoon[15968]: !!! skipped retransmitting frags: frag_flags 1, r->sendbuf->l 112, max 1280
12/16/13 11:13:36.715 PM racoon[15968]: Received retransmitted packet from x.x.x.x[500].
12/16/13 11:13:36.715 PM racoon[15968]: the packet is retransmitted by x.x.x.x[500].
12/16/13 11:13:36.745 PM racoon[15968]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:13:39.872 PM racoon[15968]: !!! skipped retransmitting frags: frag_flags 1, r->sendbuf->l 112, max 1280
12/16/13 11:13:39.872 PM racoon[15968]: Received retransmitted packet from x.x.x.x[500].
12/16/13 11:13:39.873 PM racoon[15968]: the packet is retransmitted by x.x.x.x[500].
12/16/13 11:13:40.043 PM racoon[15968]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:13:43.170 PM racoon[15968]: !!! skipped retransmitting frags: frag_flags 1, r->sendbuf->l 112, max 1280
12/16/13 11:13:43.170 PM racoon[15968]: Received retransmitted packet from x.x.x.x[500].
12/16/13 11:13:43.170 PM racoon[15968]: the packet is retransmitted by x.x.x.x[500].
12/16/13 11:13:43.335 PM racoon[15968]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:13:55.912 PM racoon[15968]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:13:56.367 PM racoon[15968]: !!! skipped retransmitting frags: frag_flags 1, r->sendbuf->l 112, max 1280
12/16/13 11:13:56.367 PM racoon[15968]: Received retransmitted packet from x.x.x.x[500].
12/16/13 11:13:56.367 PM racoon[15968]: the packet is retransmitted by x.x.x.x[500].
12/16/13 11:14:03.416 PM pppd[15967]: IPSec connection failed
12/16/13 11:14:03.416 PM racoon[15968]: IPSec disconnecting from server x.x.x.x
12/16/13 11:14:03.416 PM racoon[15968]: glob found no matches for path "/var/run/racoon/*.conf"

And this on the server's console:

12/16/13 11:13:33.404 PM racoon[216]: IPSec Phase 1 started (Initiated by peer).
12/16/13 11:13:33.404 PM racoon[216]: IKE Packet: receive success. (Responder, Main-Mode message 1).
12/16/13 11:13:33.404 PM racoon[216]: >>>>> phase change status = Phase 1 started by us
12/16/13 11:13:33.404 PM racoon[216]: IKE Packet: transmit success. (Responder, Main-Mode message 2).
12/16/13 11:13:33.541 PM racoon[216]: IKE Packet: receive success. (Responder, Main-Mode message 3).
12/16/13 11:13:33.559 PM racoon[216]: IKE Packet: transmit success. (Responder, Main-Mode message 4).
12/16/13 11:13:33.566 PM racoon[216]: Connecting.
12/16/13 11:13:36.697 PM racoon[216]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:13:36.697 PM racoon[216]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:13:39.989 PM racoon[216]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:13:43.286 PM racoon[216]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:13:56.484 PM racoon[216]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:14:06.392 PM racoon[216]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:14:12.978 PM racoon[216]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:14:32.767 PM racoon[216]: IKE Packet: transmit success. (Phase 1 Retransmit).
12/16/13 11:14:39.390 PM racoon[216]: IKEv1 Phase 1: maximum retransmits. (Phase 1 Maximum Retransmits).
12/16/13 11:14:39.390 PM racoon[216]: Phase 1 negotiation failed due to time up. 45b24df5cc9713e7:9b427f72231ccb59

One thing I notice is that the client failed at 11:14:03 while the server kept retransmitting packets for another 30 seconds, all the way until it timed out. The client in this case is Mac OS X, but iOS clients behave similarly.

What are the troubleshooting steps I should look for here?


Solution 1:

Okay, turns out it's a "bug" in the latest version of Mac OS X Server. From what I was able to find, he IKE daemon racoon won't accept connections if the source port is not UDP 4500. Most connections that go through NAT will randomize the source port, which means it won't connect. The old version of the daemon doesn't have this restriction. Connections from inside the network won't randomize the port if connecting directly to the server's IP, but loopback and external connections apparently will, leading to the failures.

The quick solution, then, was to replace the racoon binary with an old version from OS 10.8, of course backing up the old one by naming it racoon.old (or would racoon.new be more correct? :D).

Apple do appear to be aware of the issue and hopefully they'll issue a fix; in the meantime, reverting the binary works.