What are the performance implications for millions of files in a modern file system?

Let's say we're using ext4 (with dir_index enabled) to host around 3M files (with an average of 750KB size) and we need to decide what folder scheme we're going to use.

In the first solution, we apply a hash function to the file and use two levels folder (being 1 character for the first level and 2 characters to second level): therefore being the filex.for hash equals to abcde1234, we'll store it on /path/a/bc/abcde1234-filex.for.

In the second solution, we apply a hash function to the file and use two levels folder (being 2 characters for the first level and 2 characters to second level): therefore being the filex.for hash equals to abcde1234, we'll store it on /path/ab/de/abcde1234-filex.for.

For the first solution we'll have the following scheme /path/[16 folders]/[256 folders] with an average of 732 files per folder (the last folder, where the file will reside).

While on the second solution we'll have /path/[256 folders]/[256 folders] with an average of 45 files per folder.

Considering we're going to write/unlink/read files (but mostly read) from this scheme a lot (basically the nginx caching system), does it maters, in a performance sense, if we chose one or other solution?

Also, what are the tools we could use to check/test this setup?


Solution 1:

The reason one would create this sort of directory structure is that filesystems must locate a file within a directory, and the larger the directory is, the slower that operation.

How much slower depends on the filesystem design.

The ext4 filesystem uses a B-tree to store directory entries. A lookup on this table is expected to take O(log n) time, which most of the time is less than the naive linear table that ext3 and previous filesystems used (and when it isn't, the directory is too small for it to really matter).

The XFS filesystem uses a B+tree instead. The advantage of this over a hash table or B-tree is that any node may have multiple children b, where in XFS b varies and can be as high as 254 (or 19 for the root node; and these numbers may be out of date). This gives you a time complexity of O(logb n), a vast improvement.

Either of these filesystems can handle tens of thousands of files in a single directory, with XFS being significantly faster than ext4 on a directory with the same number of inodes. But you probably don't want a single directory with 3M inodes, as even with a B+tree the lookup can take some time. This is what led to creating directories in this manner in the first place.

As for your proposed structures, the first option you gave is exactly what is shown in nginx examples. It will perform well on either filesystem, though XFS will still have a bit of an advantage. The second option may perform slightly better or slightly worse, but it will probably be pretty close, even on benchmarks.

Solution 2:

In my experience, one of the scaling factors is the size of the inodes given a hash-name partitioning strategy.

Both of your proposed options creates up to three inode entries for each created file. Also, 732 files will create an inode that is still less than the usual 16KB. To me, this means either option will perform the same.

I applaud you on your short hash; previous systems I've worked on have taken the sha1sum of the given file and spliced directories based on that string, a much harder problem.

Solution 3:

Certainly either option will help reduce the number of files in a directory to something that seems reasonable, for xfs or ext4 or whatever file system. It is not obvious which is better, would have to test to tell.

Benchmark with your application simulating something like the real workload is ideal. Otherwise, come up with something that simulates many small files specifically. Speaking of that, here's an open source one called smallfile. Its documentation references some other tools.

hdparm doing sustained I/O isn't as useful. It won't show the many small I/Os or giant directory entries associated with very many files.

Solution 4:

One of the issues is the way to scan the folder.

Imagine Java method which runs scan on the folder.

It will have to allocate large amount of memory and deallocate it in short period of time which is very heavy for the JVM.

The best way is to arrange the folder structure the way that each file is in dedicated folder e.g. year/month/day.

The way full scan is done is that for each folder there's one run of the function so JVM will exit the function, deallocate RAM and run it again on another folder.

This is just example but anyway having such huge folder makes no sense.