Max Files for a Directory in Linux File System, Best Performance
There isn't an easy way to answer the question, but take a look at things like:
- /usr/share/lib/terminfo/...
- CPAN authors/id/...
In both cases, with far fewer than a million entries, the designers split the directories into multiple levels to speed up access.
If you have a million entries and the file system does not have any search structure built into the directory handling code, then accessing a file is going to require the o/s to read about half the name + inode number entries in the directory for each file. Even if it is all in the buffer pool, that becomes a significant workload.
If you introduce a tiered naming system - both the examples base it on the first characters of the name:
terminfo/lib/a/ansi
id/J/JO/JOHNL
CPAN has two levels; for your 1 million files, I'd probably use two levels too.
There is some overhead in having the extra level(s) of directory.
These schemes assume you know the one name you are looking for - searching through all names is a different proposition.
modern filesystems (ext3-4, XFS, ReiserFS, and lots of others) can easily handle huge subdirectories. that means that any single operation takes comparable times no mather how many files are there. so far, so good.
But, there are lots of operations that count as 'many operations', and those will degrade after some point. The most obvious example is a simple ls
, which not only does a stat()
on each and every file, but also sorts them. in most cases, it result in a O(n (log n)^2) behavior.
Other pain point is wildcard matching. Usually it will be a O(n) behavior, with n being the total number of files, and not only the matching files. For example, if you store a JPEG and a GIF for every item, and want to get them both with item-xx.*
, it would take a long time, even if the item-xx
part fully identifies the item you want. (Yes, on SQL a LIKE 'item-xx.%'
would take advantage of an index; but i haven't seen any FS do that)
In short: a multi-million-items directory will perform well if you give complete and precise paths. if there's any possibility of asking it to complete the names, better go with a hierarchical structure.