Why is network stack ignoring icmp replies from non-default interface?

I have following situation:

  • eth0 - default gateway ( ip: 172.28.183.100, gw: 172.28.183.1 )
  • eth0 - secondary network connection ( ip: 172.28.171.2, gw: 172.28.171.2).

routing looks like this:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.28.183.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
172.28.171.0    0.0.0.0         255.255.255.0   U     0      0        0 eth2
172.28.173.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1
78.46.78.0      172.28.171.1    255.255.255.0   UG    0      0        0 eth2
169.254.0.0     0.0.0.0         255.255.0.0     U     1000   0        0 eth0
0.0.0.0         172.28.183.1    0.0.0.0         UG    100    0        0 eth0

As you can see there is special route for 78.46.78.0/24 - this traffic should go by the secondary net eth2.

Which works. I can do any kind of tcp connections to machines in 78.46.78.0/24.

But, when I try to mtr them, I got weird result:

root@blob:~# mtr --report --report-cycles=5 78.46.78.198
HOST: blob                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. 172.28.171.1                  0.0%     5    0.6   0.6   0.5   0.6   0.0
  2. ???                          100.0     5    0.0   0.0   0.0   0.0   0.0

In tcpdump output I see returned replies of time-to-live exceeded:

10:16:28.158888 IP 172.28.171.2 > 78.46.78.198: ICMP echo request, id 2092, seq 59520, length 44
10:16:28.159363 IP 172.28.171.1 > 172.28.171.2: ICMP time exceeded in-transit, length 72
10:16:28.259153 IP 172.28.171.2 > 78.46.78.198: ICMP echo request, id 2092, seq 59776, length 44
10:16:28.359546 IP 172.28.171.2 > 78.46.78.198: ICMP echo request, id 2092, seq 60032, length 44
10:16:28.408129 IP 10.9.208.1 > 172.28.171.2: ICMP time exceeded in-transit, length 36
10:16:28.428193 IP 10.9.208.2 > 172.28.171.2: ICMP time exceeded in-transit, length 36
10:16:28.459953 IP 172.28.171.2 > 78.46.78.198: ICMP echo request, id 2092, seq 60288, length 44
10:16:28.560260 IP 172.28.171.2 > 78.46.78.198: ICMP echo request, id 2092, seq 60544, length 44
10:16:28.618138 IP 10.9.213.6 > 172.28.171.2: ICMP time exceeded in-transit, length 36
10:16:28.660678 IP 172.28.171.2 > 78.46.78.198: ICMP echo request, id 2092, seq 60800, length 44
10:16:28.708130 IP 10.9.212.253 > 172.28.171.2: ICMP time exceeded in-transit, length 36
10:16:28.730193 IP 213.158.195.13 > 172.28.171.2: ICMP time exceeded in-transit, length 36
10:16:28.761086 IP 172.28.171.2 > 78.46.78.198: ICMP echo request, id 2092, seq 61056, length 44
10:16:28.861380 IP 172.28.171.2 > 78.46.78.198: ICMP echo request, id 2092, seq 61312, length 44
10:16:28.938167 IP 213.248.89.153 > 172.28.171.2: ICMP time exceeded in-transit, length 36

but, with strace on mtr i see that these ICMP replies are not delivered to mtr!

I think that the reason might be that the source ip of icmp response comes from "wrong" interface" - i.e. ICMP reply comes from (for example) 10.9.212.253 (some intermediary router), but this ip should be routed via eth0, while it comes to eth2.

Is it sensible reason? What can I do about it to make mtr work even to my "special" network?

iptables are set using:

iptables -P INPUT   DROP
iptables -P FORWARD DROP
iptables -P OUTPUT  ACCEPT

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -i lo                                  -j ACCEPT
iptables -A INPUT -i eth1                                -j ACCEPT
iptables -A INPUT -p icmp                                -j ACCEPT

iptables -A FORWARD -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A FORWARD -i eth1                              -j ACCEPT

iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -t nat -A POSTROUTING -o eth2 -j MASQUERADE

iptables -A INPUT   -j LOG --log-prefix 'IPTABLES: '
iptables -A FORWARD -j LOG --log-prefix 'IPTABLES: '

But I don't see any icmp-related packages with kern.log.


Thanks to Rafał Ramocki - solution is simple - you have to turn off rp_filter-ing on eth2 interface:

echo 0 > /proc/sys/net/ipv4/conf/eth2/rp_filter

From kernel docs:

rp_filter
---------

Integer value determines if a source validation should be made. 1 means yes, 0
means no.  Disabled by default, but local/broadcast address spoofing is always
on.

If you  set this to 1 on a router that is the only connection for a network to
the net,  it  will  prevent  spoofing  attacks  against your internal networks
(external addresses  can  still  be  spoofed), without the need for additional
firewall rules.

While nice for preventing spoofing attacks (at least some), it definitely kills some functionality if you have more internet connections.