nftables chain with priority -300 (raw) still sees fragments; why?

According to nftables wiki (and also see this answer here), packet defragmentation happens at priority -400. However, when I put in a chain with nftables with priority level -300:

flush ruleset;
table ip test {
    chain prerouting {
         type filter hook prerouting priority -300; policy accept;
         ip frag-off & 0x1fff != 0 log;
    }
}

I clearly see fragmented packets in the kernel logs:

[ 2526.162244] IN=ens7 OUT= MAC=0c:5c:00:2d:b4:03:0c:80:9a:6a:23:01:08:00 SRC=201.201.201.1 DST=200.200.200.2 LEN=1500 TOS=0x00 PREC=0x00 TTL=63 ID=33977 MF FRAG:185 PROTO=UDP 
[ 2526.162752] IN=ens7 OUT= MAC=0c:5c:00:2d:b4:03:0c:80:9a:6a:23:01:08:00 SRC=201.201.201.1 DST=200.200.200.2 LEN=961 TOS=0x00 PREC=0x00 TTL=63 ID=33977 FRAG:370 PROTO=UDP 

The above code is just a minimal reproducible example; in our actual code, this leads to problems such as only the initial UDP fragment undergoing (raw) NAT, etc.

The kernel module nf_conntrack is loaded, along with nf_defrag_ipv4. What am I doing wrong?

EDIT:

I find that this behaviour goes away as soon as I add a rule that depends on conntrack. The rule may be anything at all, e.g.

nft add rule table test prerouting ct state new,invalid,established,related counter accept

It's as if pulling in conntrack tells Linux "I want some conntrack functionalities". So my follow-up question is, is there a way to enable conntrack without needing to add this extra (dummy) rule?


As you have noticed, the network stack does not defragment packets unless needed specifically. This is to optimize forwarding performance.

When Linux simply needs to forward the packet, it processes the L3 (IP) information for the forwarding decision. It does not need to look into L4 (TCP) information. Fragmented IP packets contain the needed information, so there is no need to defragment.

However, when performing NAT / tracking connections, the packets need to be defragmented in order to access L4 (TCP / UDP) information.

There are several options to change Linux network stack operation under /proc/sys/net/ipv4, and there are ipfrag related settings. However, I don't immediately see a "force defragmentation" setting over there.

So, it might be that enforcing connection tracking might be the only way to force IP defragmentation.