How do I most efficiently store and serve 1,000,000+ small gziped files on a Linux web server?
I have large static content that I have to deliver via a Linux-based webserver. It is a set of over one million small, gzip files. 90% of the files are less than 1K and the remaining files are at most 50K. In the future, this could grow to over 10 million gzip files.
Should I put this content in a file structure or should I consider putting all this content in a database? If it is in a file structure, can I use large directories or should I consider smaller directories?
I was told a file structure would be faster for delivery, but on the other side, I know that the files will take a lot of space on the disk, since files blocks will be more than 1K.
What is the best strategy regarding delivery performance?
UPDATE
For the records, I have performed a test under Windows 7, with half-million files:
I would guess that a FS structure would be faster, but you will need a good directory structure to avoid having directories with a very large number of files.
I wouldn't worry too much about lost disk space. As an example, at 16K block size, you will loose 15GB of space in the worst case where you need one additional block for every single file. With todays disk sizes, that's nothing and you can adapt the parameters of your file system for your specific need.
If you choose the file structure option, one thing you can do to improve disk I/O performance at least to some degree is to mount the partition with noatime+nodiratime unless you must have them. They are not really important at all so I recommend doing that. Maybe you can also use a solid-state drive.
I think the correct answer here depends on how the files will be indexed... what determines when a given file is selected for delivery.
If you're already making a database query to determine your file name, you may very well find that you are better off keeping the file right there in the db record, you may find the best results from tweaking some paging settings in your database of choice and then storing the files in the db (ex: larger pages to account for all the blob records), or you may find that you're still better off using the file system.
The database option has a little better chance to work out because, with a million records, it's probable that each file is not equally likely to be queried. If you're in a situation where one file may be queried several times in a row, or nearly in a row, the database can act as a de facto cache for recently retrieved files, in which case you'll often have your file result already loaded to memory. You may need to carefully tune the internals of your database engine to get the behavior you want.
But the main thing to take away from my answer is that you don't really know what will work best until you try it with some representative test data and measure the results.
With modern filesystems it shouldn't be much of a problem. I've tested XFS with 1 billion files in the same directory, and I'm pretty sure ext4 will do fine too (as long as the filesystem itself is not too big). Have enough memory to cache the directory entries; bigger processor cache will help a lot too.