linux: accessing thousands of files in hash of directories
I would like to know what is the most efficient way of concurrently accessing thousands of files of a similar size in a modern Linux cluster of computers.
I am carrying an indexing operation in each of these files, so the 4 index files, about 5-10x smaller than the data file, are produced next to the file to index.
Right now I am using a hierarchy of directories from ./00/00/00
to ./99/99/99
and I place one file at the end of each directory,
like ./00/00/00/file000000.ext
to ./99/99/99/file999999.ext
.
It seems to work better than having thousands of files in the same directory but I would like to know if there is a better way of laying out the files to improve access.
Solution 1:
A common performance problem with large directories on ext[34] is that it hashes the directory entries, and stores them in hash order. This allows resolving a specific name quickly, but effectively randomizes the order the names are listed in. If you are trying to operate on all of the files in the directory and just iterate over each entry in the order they are listed in, you cause a lot of random IO, which is very slow. The workaround to this is to sort the directory listing by inode number, then loop over the files in order from lowest to highest inode number. This keeps your IO mostly sequential.
Solution 2:
A commonly used schema is renaming the files with their hash value while keeping the extension and using the first characters to store them in different folders.
i.e:
md5(test.jpg) gives you "13edbb5ae35af8cbbe3842d6a230d279"
Your file will be named "13edbb5ae35af8cbbe3842d6a230d279.jpg" and you store it in ./13/ed/bb/5ae35af8cbbe3842d6a230d279.jpg, that way and given a big amount of files you should have a good distribution of files per folder.
You end up with a similar tree as yours but lighter (metadata-wise) as you only have to store the original filename and its hash (the path being constructed from the hash).
As a side effect (which must be taken in account in development) you automatically gain file-based deduplication.
In addition to that if you generate the hash before storing the file you get free error checking too. You could imagine coding a little cronjob to check the integrity of your backups this way for example.