What might prevent IKE handshake success in building an IPSEC tunnel?

We use Cisco ASA for our IPSEC VPNs, using the EZVPN method. From time to time we encounter problems where an ISP has made a change to their network and our VPN stops working. Nine times out of ten the ISP denies that their change could have stopped this working - I suspect because they don't understand exactly what might have caused the problem. Rather than just bashing heads with them I want to try and point them in a direction that might get a speedier resolution.

In my current incident, I can ssh onto the external interface of the ASA and do a little poking around:

 sh crypto isakmp sa

   Active SA: 1
    Rekey SA: 0 (A tunnel will report 1 Active and 1 Rekey SA during rekey)
Total IKE SA: 1

1   IKE Peer: {Public IP address of London ASA}
    Type    : user            Role    : initiator
    Rekey   : no              State   : AM_TM_INIT_XAUTH_V6C

At the other end of the link I see the following:

Active SA: 26
<snip>
25  IKE Peer: {public IP address of Port-Au-Prince-ASA}
    Type    : user            Role    : responder
    Rekey   : no              State   : AM_TM_INIT_MODECFG_V6H

I can't find any documentation for what AM_TM_INIT_XAUTH_V6C or AM_TM_INIT_MODECFG_V6H, but I'm pretty sure it means that the IKE handshake has failed for some reason.

Can anyone suggest any likely things that might be preventing IKE from succeeding, or specific details of what AM_TM_INIT_XAUTH_V6C means?

Update: We connected the ASA at the site of a customer of another ISP. The VPN connection came up immediately. This confirms that the problem is not configuration related. The ISP is now accepting responsibility and investigating further.

Update: The connection suddenly came back online last week. I have notified the ISP to see if they changed anything, but not heard back yet. Frustratingly I am now seeing a similar issue on another site. I found a Cisco doc on the effects of fragmentation on VPN. I am starting to think that this may be the cause of the issues I am seeing.


Solution 1:

With a little assistance from Cisco I did some deeper analysis of what was happening, and figured out the things that I needed to be checking for. The useful things that Cisco told me:

  • debug crypto isakmp 5 gives enough detail to see whether problems are occurring with ISAKMP traffic
  • clear crypto isakmp sa clears out any stale security associations.
  • clear crypto isakmp {client_ip_address} can be used on the HQ to clear out a specific security association (you don't necessarily want to clear all your security associations if it is only one device that is having trouble!
  • packet captures at both ends are really useful to figure out what is going on

Reading up a little on the IPSEC suite, and ISAKMP more specifically showed that the following need to be allowed through any firewalls in the path:

  • ISAKMP traffic on UDP port 500
  • ISAKMP (used for NAT-Tunnelling) traffic on UDP port 4500
  • ESP traffic (IP Protocol 50)
  • AH traffic (IP Protocol 51)

It seems a lot of people out there don't realise the important difference between IP protocols and TCP/UDP ports.

The following packet captures focussed on the above types of traffic. These were set up on both the remote and HQ ASAs:

object service isakmp-nat-t 
    service udp destination eq 4500 
    description 4500
object-group service ISAKMP-Services
    description Traffic required for ISAKMP
    service-object esp 
    service-object ah 
    service-object object isakmp-nat-t 
    service-object udp destination eq isakmp
access-list ISAKMP extended permit object-group ISAKMP-Services host {hq_ip_address} host {remote_ip_address}
access-list ISAKMP extended permit object-group ISAKMP-Services host {remote_ip_address} host {hq_ip_address}
capture ISAKMP access-list ISAKMP interface outside

You can then download the captures from each device at https://{device_ip_address}/capture/ISAKMP/pcap and analyse it in Wireshark.

My packet captures showed that ISAKMP traffic outlined above was getting fragmented - since those packets are encrypted, once they are fragmented it is hard to put them back together and things break.

Giving this information to the ISP meant they could do their own focussed checking, and resulted in them making some changes to a firewall. Turns out the ISP was blocking all ICMP traffic on their edge router, which meant that Path MTU Discovery was broken, resulting in fragmented ISAKMP packets. Once they stopped blanket blocking ICMP the VPN came up (and I expect all their customers started getting better service in general).