NFSv4 "Too many levels of symbolic links" error

Both machines are running Ubuntu 12.04

Remote NFSv4 Client

$ ls /mnt/storage/aaaaaaa_aaa/bbbb/cccc_ccccc gives this error:
ls: reading directory .: Too many levels of symbolic links

How can I fix this?

When error occurs ls start listing the files, however PHP brakes.

On the NFSv4 Server

In /etc/fstab:

/mnt/storage    /srv/storage    none    bind    0 0

In /etc/exports

/srv         192.168.1.0/24(rw,async,insecure,no_subtree_check,crossmnt,fsid=0,no_root_squash)
/srv/storage   192.168.1.0/24(rw,async,nohide,insecure,no_subtree_check,no_root_squash)

ERROR

root@ds:root@ds:/mnt/storage/foreign_dbs/imdb/imdb_htmls# ls -l | head
ls: reading directory .: Too many levels of symbolic links
total 10302840
-rw-r--r-- 1 root root  10484 Jul  5 13:56 0019038.gz
-rw-r--r-- 1 root root  16264 Mar 30 00:31 0259701.gz
-rw-r--r-- 1 root root  13784 Mar 30 14:20 1000000.gz
-rw-r--r-- 1 root root  12741 Mar 30 13:04 1000003.gz
-rw-r--r-- 1 root root  12794 Mar 30 12:40 1000004.gz
-rw-r--r-- 1 root root  13123 Mar 30 12:07 1000005.gz
-rw-r--r-- 1 root root  13183 Mar 30 12:04 1000006.gz
-rw-r--r-- 1 root root  13443 Jul  4 01:16 1000007.gz
-rw-r--r-- 1 root root  12968 Mar 30 11:05 1000008.gz

I came across it in PHP. scandir would return 1612577.gz & 1612579.gz, but skips 1612578.gz and yet the file types and properties are identical on them

and this only happens on the nfs client, works 100% on the server


About the problem

You can have a problem where two or more files have the same readdir cookie.

This problem is more common when using a NFS filesystem (v3 or v4) over an EXT4 backend and with a lot of files in the same directory (more than 50000). It problem can also occur when using GlusterFS instead of NFS.

PS: This problem can occur also with only few files inside a single directory, but this last case is very very improbable.

In this case, you will see Too many levels of symbolic links errors even if you have no symlinks inside your directory. You can prove this verifying that the following command returns no output:

find /mnt/storage/aaaaaaa_aaa/bbbb/cccc_ccccc -type l

To check if you're getting this specific problem, run the above command:

$ ls /mnt/storage/aaaaaaa_aaa/bbbb/cccc_ccccc >/dev/null
ls: reading directory .: Too many levels of symbolic links

After, check your syslog (/var/log/syslog) for entries like:

[400000.200000] NFS: directory /mnt/storage/aaaaaaa_aaa/bbbb/cccc_ccccc
contains a readdir loop. Please contact your server vendor.
The file: DDDDDDDDDD has duplicate cookie COOKIE_NUMBER.

The problem is related to the readdir function of the readdir API, that uses the readdir cookie to quickly locate a file inside a directory. The NFS server uses this API while communicating with EXT4 backends.

A complete and excellent explanation about the duplicate cookie problem (actually, a hash collision problem) can be found at Widening ext4's readdir() cookie.

A related bug report can be found at NFS client reports a 'readdir loop' with a corrupt name.

If you can reboot your system, the good news is that, according to David Hedberg, this problem is already solved in newer Ubuntu kernel versions (>= 3.2.0-60-generic). You may need to update your NFS server also (the solution only works if both NFS server and Kernel are updated).

PS: If you really love Operating Systems, you can check the kernel/nfs patchs at http://comments.gmane.org - 32/64 bit llseek hashes.

Solution

Update your kernel and NFS kernel server and reboot the system:

apt-get -y dist-upgrade
reboot

If you can't reboot the system, you can also detect the file with the duplicated readdir cookie (check your syslog) and move it to another dir (or rename it to change it's cookie/hash).


Somewhere you have a symbolic link that points back to its parent. Use this to find it:

find /mnt/storage -type l -exec ls -l {} \;

Once you do, then perhaps you can figure out how to correct it.