Intra-cloud VPN IPSec AWS - Google Cloud times out and re-connects intermittently

My AWS - Google Cloud IPSec VPN tunnel Dead Peer Detection (DPD) keeps timing out and forcing it to re-establis the connection at least once a day. This means production services on Google Cloud are down while the connection re-establishes for about 4 mins each time.

I've tried running a bash script on a Google Cloud VM every 5 mins to ping AWS - but the timeouts keep happening. I guess ICMP isn't considered interesting traffic. Is there any other way I can troubleshoot this / force the connection to stay open?

Rationale: I currently have a RDS PostgreSQL database and a Redis server running in AWS that some services on Google Cloud need to access. Instead of opening it up publicly, especially because Redis security is pretty much network isolation - I opted to open an IPSec tunnel so they could communicate via internal IPs. I followed the instructions here and got the VPN working. https://medium.com/google-cloud/vpn-between-two-clouds-e2e3578be773 - uses static routing - IPSec IKEv1 - Generic VPN configuration for AWS

Looking at the VPN logs from the Google Cloud end (I don't see a way to see this on AWS) - this is what it looks like most of the time.

D  sending DPD request 
D  generating INFORMATIONAL_V1 request 1644989539 [ HASH N(DPD) ] 
D  sending packet: from XX.XX.XX.XX[4500] to XX.XX.XX.XX[4500] (92 bytes) 
D  received packet: from XX.XX.XX.XX[4500] to XX.XX.XX.XX[4500] (92 bytes) 
D  parsed INFORMATIONAL_V1 request 4120768163 [ HASH N(DPD_ACK) ] 
D  sending DPD request 

When it goes down, it looks like the DPD check fails and it tries to re-establish the connection. I've pasted the full log below in case it is useful.

D  DPD check timed out, enforcing DPD action 
D  creating acquire job for policy with reqid {1} 
I  initiating Main Mode IKE_SA vpn_BB.BB.BB.BB[58] to BB.BB.BB.BB 
D  generating ID_PROT request 0 [ SA V V V V ] 
D  sending packet: from AA.AA.AA.AA[500] to BB.BB.BB.BB[500] (156 bytes) 
D  received packet: from BB.BB.BB.BB[500] to AA.AA.AA.AA[500] (124 bytes) 
D  parsed ID_PROT response 0 [ SA V V ] 
D  received DPD vendor ID 
D  received NAT-T (RFC 3947) vendor ID 
D  generating ID_PROT request 0 [ KE No NAT-D NAT-D ] 
D  sending packet: from AA.AA.AA.AA[500] to BB.BB.BB.BB[500] (244 bytes) 
D  received packet: from BB.BB.BB.BB[500] to AA.AA.AA.AA[500] (228 bytes) 
D  parsed ID_PROT response 0 [ KE No NAT-D NAT-D ] 
D  remote host is behind NAT 
D  generating ID_PROT request 0 [ ID HASH N(INITIAL_CONTACT) ] 
D  sending packet: from AA.AA.AA.AA[4500] to BB.BB.BB.BB[4500] (108 bytes) 
D  received packet: from BB.BB.BB.BB[4500] to AA.AA.AA.AA[4500] (76 bytes) 
D  parsed ID_PROT response 0 [ ID HASH ] 
I  IKE_SA vpn_BB.BB.BB.BB[58] established between AA.AA.AA.AA[AA.AA.AA.AA]...BB.BB.BB.BB[BB.BB.BB.BB] 
D  scheduling rekeying in 35670s 
D  maximum IKE_SA lifetime 36270s 
D  generating QUICK_MODE request 3920627352 [ HASH SA No KE ID ID ] 
D  sending packet: from AA.AA.AA.AA[4500] to BB.BB.BB.BB[4500] (316 bytes) 
D  received packet: from BB.BB.BB.BB[4500] to AA.AA.AA.AA[4500] (300 bytes) 
D  parsed QUICK_MODE response 3920627352 [ HASH SA No KE ID ID ] 
D  handling HA CHILD_SA vpn_BB.BB.BB.BB{382} 0.0.0.0/0  === 0.0.0.0/0  (segment in: 1, out: 1) 
I  CHILD_SA vpn_BB.BB.BB.BB{382} established with SPIs 7ac4f08b_i 9780f8e5_o and TS 0.0.0.0/0 === 0.0.0.0/0  
D  generating QUICK_MODE request 3920627352 [ HASH ] 
D  sending packet: from AA.AA.AA.AA[4500] to BB.BB.BB.BB[4500] (60 bytes) 

By default, Cloud VPN negotiates a replacement SA before the existing one expires (also known as rekeying). Your on-premises VPN (In this case AWS end) gateway might not be rekeying. Instead, it might negotiate a new SA only after deleting the existing SA, causing interruptions. As per GCP documentation on Common problems and solutions

I am suspecting DPD is just the symptom not the actual cause of this issue. I have also noticed from the log that maximum IKE_SA lifetime is set to 36270 seconds but as per GCP documentation lifetime should be:

Phase 1 lifetime 36,600 seconds (10 hours, 10 minutes)

Phase 2 lifetime 10,800 seconds (3 hours)

When IKEv1 is used, as per recommendation GCP IKEv1 supported ciphers.

In light of this I believe there is some configuration issue. Google Cloud Platform has specific guidelines for AWS VPN gateway solution with Cloud VPN, which should help in this case.

In light of this, I would recommend revisiting the VPN configuration and make sure the configuration is done as per Google Cloud VPN Interop Guide.