max files per directory in ext4
Solution 1:
The ext3
and later filesystems support hashed B-tree directory indexing. This scales very well as long as the only operations you do are add, delete, and access by name. However, I would still recommend breaking the directories down. Otherwise, you create a dangerous booby trap for tools (updatedb
, ls
, du
, and so on) that perform other operations on directories that can blow up if the directory has too many entries.
Solution 2:
The core of the problem is digging through the directory inode for the one file you want. Some filesystems do this better than others. Some scale close to the billions, but if you only have... 20K files getting to those files is markedly faster. Also, large file-counts create problems for certain tools and may make backup/restore a much harder problem as a result.
As it happens I ran into the exact same problem in our own development (md5sum as filename, scaling thereof). What I recommended to our developers is to chop the string into pieces. They went with groups of 4, but on the filesystem we were on at the time even that many would prove problematic from a performance perspective, so they ended up splitting on a group-of-3 for the first 6 triplets and leaving the rest as the filename in the terminal directory.
Group of 4: 4976/d70b/180c/6142/c617/d0c8/9d0b/bd2b.txt
Group of 3: 497/6d7/0b1/80c/614/2c6/17d0c89d0bbd2b.txt
This has the advantage of keeping directory sizes small, and since MD5sum is pretty random, it'll create balanced directory trees. That last directory is unlikely to ever get more than a few files. And wasn't that hard to work into our code. We work with multi-million file projects, so scaling was very important to us.
Solution 3:
Modern filesystems handle very large directories very well, even to millions of files. But conventional tools do not. For example listing such a large directory with "ls" would take quite a long time since it would normally read the entire directory and sort it (although you can use ls -f to avoid sorting). It would not start showing files until all are read. Splitting the names helps in some cases, but not in all (for example rsync replication could still need to collect the entire tree of names).