Intra-cloud VPN IPSec AWS - Google Cloud times out and re-connects intermittently
My AWS - Google Cloud IPSec VPN tunnel Dead Peer Detection (DPD) keeps timing out and forcing it to re-establis the connection at least once a day. This means production services on Google Cloud are down while the connection re-establishes for about 4 mins each time.
I've tried running a bash script on a Google Cloud VM every 5 mins to ping AWS - but the timeouts keep happening. I guess ICMP isn't considered interesting traffic. Is there any other way I can troubleshoot this / force the connection to stay open?
Rationale: I currently have a RDS PostgreSQL database and a Redis server running in AWS that some services on Google Cloud need to access. Instead of opening it up publicly, especially because Redis security is pretty much network isolation - I opted to open an IPSec tunnel so they could communicate via internal IPs. I followed the instructions here and got the VPN working. https://medium.com/google-cloud/vpn-between-two-clouds-e2e3578be773 - uses static routing - IPSec IKEv1 - Generic VPN configuration for AWS
Looking at the VPN logs from the Google Cloud end (I don't see a way to see this on AWS) - this is what it looks like most of the time.
D sending DPD request
D generating INFORMATIONAL_V1 request 1644989539 [ HASH N(DPD) ]
D sending packet: from XX.XX.XX.XX[4500] to XX.XX.XX.XX[4500] (92 bytes)
D received packet: from XX.XX.XX.XX[4500] to XX.XX.XX.XX[4500] (92 bytes)
D parsed INFORMATIONAL_V1 request 4120768163 [ HASH N(DPD_ACK) ]
D sending DPD request
When it goes down, it looks like the DPD check fails and it tries to re-establish the connection. I've pasted the full log below in case it is useful.
D DPD check timed out, enforcing DPD action
D creating acquire job for policy with reqid {1}
I initiating Main Mode IKE_SA vpn_BB.BB.BB.BB[58] to BB.BB.BB.BB
D generating ID_PROT request 0 [ SA V V V V ]
D sending packet: from AA.AA.AA.AA[500] to BB.BB.BB.BB[500] (156 bytes)
D received packet: from BB.BB.BB.BB[500] to AA.AA.AA.AA[500] (124 bytes)
D parsed ID_PROT response 0 [ SA V V ]
D received DPD vendor ID
D received NAT-T (RFC 3947) vendor ID
D generating ID_PROT request 0 [ KE No NAT-D NAT-D ]
D sending packet: from AA.AA.AA.AA[500] to BB.BB.BB.BB[500] (244 bytes)
D received packet: from BB.BB.BB.BB[500] to AA.AA.AA.AA[500] (228 bytes)
D parsed ID_PROT response 0 [ KE No NAT-D NAT-D ]
D remote host is behind NAT
D generating ID_PROT request 0 [ ID HASH N(INITIAL_CONTACT) ]
D sending packet: from AA.AA.AA.AA[4500] to BB.BB.BB.BB[4500] (108 bytes)
D received packet: from BB.BB.BB.BB[4500] to AA.AA.AA.AA[4500] (76 bytes)
D parsed ID_PROT response 0 [ ID HASH ]
I IKE_SA vpn_BB.BB.BB.BB[58] established between AA.AA.AA.AA[AA.AA.AA.AA]...BB.BB.BB.BB[BB.BB.BB.BB]
D scheduling rekeying in 35670s
D maximum IKE_SA lifetime 36270s
D generating QUICK_MODE request 3920627352 [ HASH SA No KE ID ID ]
D sending packet: from AA.AA.AA.AA[4500] to BB.BB.BB.BB[4500] (316 bytes)
D received packet: from BB.BB.BB.BB[4500] to AA.AA.AA.AA[4500] (300 bytes)
D parsed QUICK_MODE response 3920627352 [ HASH SA No KE ID ID ]
D handling HA CHILD_SA vpn_BB.BB.BB.BB{382} 0.0.0.0/0 === 0.0.0.0/0 (segment in: 1, out: 1)
I CHILD_SA vpn_BB.BB.BB.BB{382} established with SPIs 7ac4f08b_i 9780f8e5_o and TS 0.0.0.0/0 === 0.0.0.0/0
D generating QUICK_MODE request 3920627352 [ HASH ]
D sending packet: from AA.AA.AA.AA[4500] to BB.BB.BB.BB[4500] (60 bytes)
By default, Cloud VPN negotiates a replacement SA before the existing one expires (also known as rekeying). Your on-premises VPN (In this case AWS end) gateway might not be rekeying. Instead, it might negotiate a new SA only after deleting the existing SA, causing interruptions. As per GCP documentation on Common problems and solutions
I am suspecting DPD is just the symptom not the actual cause of this issue. I have also noticed from the log that maximum IKE_SA lifetime is set to 36270 seconds but as per GCP documentation lifetime should be:
Phase 1 lifetime 36,600 seconds (10 hours, 10 minutes)
Phase 2 lifetime 10,800 seconds (3 hours)
When IKEv1 is used, as per recommendation GCP IKEv1 supported ciphers.
In light of this I believe there is some configuration issue. Google Cloud Platform has specific guidelines for AWS VPN gateway solution with Cloud VPN, which should help in this case.
In light of this, I would recommend revisiting the VPN configuration and make sure the configuration is done as per Google Cloud VPN Interop Guide.