Which block sizes for millions of small files
I have 2x 4TB Disks in hardware RAID1 (it might be a LSI MegaRaid) on Debian Wheezy. The physical block size is 4kB. I'm going to store 150-200 million of small files (between 3 and 10kB). I'm not asking for performance, but for best filesystem and block sizes to save storage. I've copied a file of 8200 byte onto an ext4 with block size of 4kB. This took 32kB of disk!? Is journaling the reason for that? So what options are there to save most storage for such small files?
If I were in that situation, I'd be looking at a database that can store all the data in a single file with a compact, offset-based index, rather than as separate files. Maybe a database that has a FUSE driver available for interacting with it as files when necessary, without them actually all BEING separate files.
Alternatively, you could look at say, the 60th--70th percentile of file sizes, and try to fit that filesize directly into the filesystem tree nodes, rather than as separate blocks on disk. Storing 10k in each node is probably a big ask, but if you could get 60%-70% of files in there, that would probably be a huge win.
Only certain filesystems can do that at all (reiserfs is one), and I guess it all depends on what size that percentile is, whether it WILL fit in the tree. You may be able to tune it. I guess try to fit the rest into one block.
And don't worry about journals; they have an upper size limit anyway.