NAT Packet goes out on wrong Gateway

I've two interfaces, let's say eth0 and eth0.4000 as vlan. Both have a default-gateway. Everything works as expected when a process listens on the interface directly.

But not for hostPort bindings of Kubernetes.

vlan.gw-mac > eth0-mac,    ethertype 802.1Q (0x8100), length 78: vlan 4000, p 0, ethertype IPv4 (0x0800), clientIP.38712 > vlanIP.80: Flags [S]
eth0-mac    > eth0.gw-mac, ethertype IPv4   (0x0800), length 74:                                          vlanIP.80      > clientIP.38712: Flags [S.]

The SYN comes from vlan.gateway, gets forwarded to the Container but the answer SYN-ACK leaves the stack through eth0.gw and not the correct vlan.gw but tcpdump shows that the sourceIP is vlanIP.

The routing tables looking good:

# ip route get to <clientIP> from <vlanIP> dev eth0.4000
<clientIP> from <vlanIP> via <vlan.gw> dev eth0.4000 table 1 uid 0

The hostPort mapping gets created through the CNI-Plugin portmap which uses DNAT and SNAT (details linked). So the gateway lookup happens to early. When I manually add a route from the container-ip to lookup table 1 it works using the vlan interface but breaks eth0.

So the questions is - what has to be done that the routing happens after NAT replaced the container-ip with the interface-ip?


You are right that the implicit SNAT from the DNAT is happening too late: At that point, the routing decision has already been made, so the correct source IP is used on the incorrect interface.

To avoid this, you'll need to go deeper on the policy based routing. A technique described in https://superuser.com/questions/638044/source-based-policy-routing-nat-dnat-snat-aka-multi-wans-on-centos-5 can be used.

For this, you need to have in your PREROUTING chain in the mangle table:

-A PREROUTING -i vlanIface -m state --state NEW,RELATED,ESTABLISHED -d <vlanIP> -j CONNMARK --set-mark 0x10/0x10
-A PREROUTING -m connmark --mark 0x10/0x10 -j CONNMARK --restore-mark --cfmask 0x10

This way, all packets belonging to connections which were initiated over the vlanIface will have 0x10 set in their fwmark. This can then be used for PBR. Assuming your pod network is 10.0.0.0/8 and the table for your secondary gateway is 1:

ip rule add fwmark 0x10/0x10 from 10.0.0.0/8 table 1

You might be able to leave out the from 10.0.0.0/8, but it is a useful safety net against incorrectly set fwmarks (e.g. because of the specific mark being used by other stuff).