Timeouts on Cloud SQL and other external services when using NAT + IP Masquerade on GKE

I have to configure a static IP in one of my PODs because a remote service (outside of my cluster) requires trusted IP whitelisting.

I followed the documentation provided by Google:

https://cloud.google.com/nat/docs/overview?hl=es-419

https://cloud.google.com/kubernetes-engine/docs/how-to/ip-masquerade-agent

But when trying to configure egress traffic using Google cloud NAT service in my GKE cluster plus masquerading using the ip-masq-agent I start getting timeouts and problems when accessing remote services outside of the cluster.

My Cluster is in version 1.19.10-gke.1600.

I have tried these config files with the following results:

resyncInterval: 60s

Result:

Chain IP-MASQ (2 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             10.0.0.0/8           /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             172.16.0.0/12        /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.168.0.0/16       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
MASQUERADE  all  --  anywhere             anywhere             /* ip-masq-agent: outbound traffic is sub
ject to MASQUERADE (must be last in chain) */

The services keep using the wrong IP.


resyncInterval: 60s
masqLinkLocal: true

Chain IP-MASQ (2 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             169.254.0.0/16       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             10.0.0.0/8           /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             172.16.0.0/12        /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.168.0.0/16       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
MASQUERADE  all  --  anywhere             anywhere             /* ip-masq-agent: outbound traffic is sub
ject to MASQUERADE (must be last in the chain) */

The same effect, my outside services get the wrong IP.


nonMasqueradeCIDRs:
  - 0.0.0.0/0
resyncInterval: 60s
masqLinkLocal: true

Chain IP-MASQ (2 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere             /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
MASQUERADE  all  --  anywhere             anywhere             /* ip-masq-agent: outbound traffic is sub
ject to MASQUERADE (must be last in the chain) */

It looks this works better because the external services receive the correct IP but I get connection problems and timeouts.


This is my NAT configuration:

NAT mapping
- High availability: Yes
- Source subnets & IP ranges: All subnets' primary and secondary IP ranges
- NAT IP addresses: static-egress-ip XXX.XXX.XXX.XXX

I'm out of ideas, can someone give me any advice?


After the response got here I updated my config file to add the ips following google cloud documentation, the file goes like this:

nonMasqueradeCIDRs:
  - 10.0.0.0/8
  - 172.16.0.0/12
  - 192.168.0.0/16
  - 100.64.0.0/10
  - 192.0.0.0/24
  - 192.0.2.0/24
  - 192.88.99.0/24
  - 198.18.0.0/15
  - 198.51.100.0/24
  - 203.0.113.0/24
  - 240.0.0.0/4
resyncInterval: 60s
masqLinkLocal: true

The result of this in the iptables is:

Chain IP-MASQ (2 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             10.0.0.0/8           /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             172.16.0.0/12        /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.168.0.0/16       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             100.64.0.0/10        /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.0.0.0/24         /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.0.2.0/24         /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             192.88.99.0/24       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             198.18.0.0/15        /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             198.51.100.0/24      /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             203.0.113.0/24       /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
RETURN     all  --  anywhere             240.0.0.0/4          /* ip-masq-agent: local traffic is not sub
ject to MASQUERADE */
MASQUERADE  all  --  anywhere             anywhere             /* ip-masq-agent: outbound traffic is sub
ject to MASQUERADE (must be last in chain) */

But if I run a curl checkip.amazonaws.com to see what IP is being used by the node I get a different IP from the one defined in my NAT Cloud configuration and the external services reject request as non trusted from my cluster.


It seems you have set the nonMasqueradeCIDRs: as 0.0.0.0/0 thereby preventing Masquerading of all the CIDR traffic, so to fix this issue, in the config file update the nonMasqueradeCIDRs: key with the IPs mentioned in Defaut non-masquerade destination paragraph [1] as given below.

nonMasqueradeCIDRs:

  • 172.16.0.0/12
  • 192.168.0.0/16
  • 100.64.0.0/10
  • 192.0.0.0/24
  • 192.0.2.0/24
  • 192.88.99.0/24
  • 198.18.0.0/15
  • 198.51.100.0/24
  • 203.0.113.0/24
  • 240.0.0.0/4
  • 10.0.0.0/8

Also please note that the IPs referred in the screenshot were not wrong IPs but those are ranges reserved by RFC 1918/link-local i.e., the IPs 10.0.0.0/8, 172.16.0.0/12 192.168.0.0/16 are reserved for RFC 1918 and the IP range 169.254.0.0/16 is reserved for link-local and these are non-masqueradable and hence these IPs are being displayed with the description ‘ip-masq-agent: local traffic is not subject to masquerade’[2].

[1] https://cloud.google.com/kubernetes-engine/docs/how-to/ip-masquerade-agent#default-non-masq-dests

[2] https://kubernetes.io/docs/tasks/administer-cluster/ip-masq-agent/#ip-masquerade-agent-user-guide

Regards, Anbu.