NFS Stale File Handle After NFS Server Reboots: Why Does This Happen and How Does Industry Handle This?

Solution 1:

You're using NFS version 3, which needs several helper services in addition to the main NFS service in port 2049. One of these is rpc.statd, which has a key role in detecting reboots and recovering/clearing NFS locks after a reboot.

These helper services may be located in random ports, and they are discovered by contacting the RPC port mapper (usually a process named rpcbind on modern Linuxes). In modern networks with firewalls, such behavior can make things difficult: even though you may find them in deterministic-looking ports after a reboot, they may get allocated to quite different port numbers if/when you restart NFS services.

Fortunately, on many modern Unix-like systems, you can lock down the port numbers of the NFS lock manager (historically rpc.lockd, nowadays usually implemented in-kernel), rpc.statd and rpc.mountd. This is essential if you want to pass NFSv3 through firewalls with any sort of reliability.

For RHEL and related distributions, you can lock down the NFS helper port numbers by adding these lines in /etc/sysconfig/network:

LOCKD_TCPPORT=4045
LOCKD_UDPPORT=4045
STATD_PORT=4046
MOUNTD_PORT=4047

For Debian and related distributions, you might add this line to /etc/modprobe.d/nfs.conf:

options lockd nlm_udpport=4045 nlm_tcpport=4045

... and this line in /etc/default/nfs-common:

STATDOPTS="-p 4046"

... and this line in /etc/default/nfs-kernel-server:

RPCMOUNTDOPTS="-p 4047" # you may want to add a --manage-gids option here

(You can use different port numbers if you wish, but 4045 is the default port for NFSv3 lock manager in Solaris and hard-coded for the same in HP-UX 11.31.)

But there is another pitfall in the NFSv3 protocol. Although you can successfully mount a NFS share using just IP addresses, the NFSv3 lock protocol internally uses hostnames. Both the client and server must know each other by the correct names, otherwise the NFS file locking and lock recovery after a reboot won't work. And the "correct name" for each system is the name reported by uname -n.

So, if uname -n returns server.example on the server and respectively client.example on the client, then you should make sure those exact names will resolve to the IP addresses the hosts need to use for NFS. In other words, the server must be able to contact the client's rpc.statd using the name client.example and vice versa.

If you don't, everything may seem to work well at first... but when either end reboots, you may get those Stale file handle errors.

Solution 2:

In addition to the excellent answer from @telcoM, I would like to suggest two other possible solutions:

mount nfs with the noac option (beware that this will cause a performance loss when issuing ls on large directory or stat on many files)
use NFS v4.1 (v4.0 had some bugs leading to "stale file handling", so be sure to select the v4.1 protocol).

NFS Stale File Handle After NFS Server Reboots: Why Does This Happen and How Does Industry Handle This?

Solution 1:

Solution 2:

Related

Recent Posts