Ridiculous Number of Files in a Directory

My employer acquired a company with one particular piece of software that stores a lot of PDF and PNG file files in one directory. When I first replicated it from AWS, there were about 11.5 million files. Now the number is approaching 13 millon and performance is -- to be charitable -- pathetic.

The directory has to be shared between four servers so just attaching a LUN to each server is out. When I did the original copy I tried an ext4 filesystem but I started having serious problems at about 10 million. I considered trying XFS but the short lead time demanded I just get them compied. I finally put them on a Dell Isilon which has a UFS file system and runs BSD. The directory is exported using NFS.

If the decision is to build a new NFS server just for this, which file systems will be able to handle such a ridiculous number of files and still give decent performance when retrieving them? I know the best solution would be to break things up so there are not so many files in one directory but in the contest between fast, cheap, and good, the good always gets last place.


Solution 1:

Very large numbers of files in one directory will eventually become unusably slow, due to enormous directory metadata. Fast storage and file system choice will only do so much.

Restructure this tree to be many directories. Compute a uniform hash of the content, and store in directories named after the first couple digits. For a SHA of da39a3ee5e6b4b0d3255bfef95601890afd80709, store that in da/39a3ee5e6b4b0d3255bfef95601890afd80709.png

Or, perhaps the application can deal with object storage instead of regular files. S3 protocol or similar. Being a different API it can lose the dentry semantics of local file systems. Content addressable enables restructuring or scaling up the storage without changing how the application finds blobs.

Restructuring and application changes were not what you wanted to hear. But can't really avoid it, a smarter way to store blobs is needed to scale that big.