Linux Filesystems

Solution 1:

You're running into a well-known issue. While there are filesystems that will accommodate millions of files (XFS and ReiserFS on Linux, and NTFS on Windows), they still have to sift through the stack of filenames searching for that one file. Just because it accommodates that many files doesn't mean that it will be quick. I have requested file properties on a Windows server with just tens of thousands of files on it, and that was pretty much a "go to lunch and come back" deal. I've also tried to get a directory listed via ls and found that the 20,000 some odd files in it required about 2 minutes of processing on a busy server (the filesystem is Ext3).

Fortunately, there is a solution, although it might be a bit different from what you're expecting.

Use additional subdirectories.

This is a well-known strategy and has been successfully used in a variety of programs. For instance, Squid uses layers of subdirectories to deal with the exact same issue for the same reason - hundreds of thousands of files that need to be quickly accessed. By using just one additional layer of directories, they can manage millions easily.

It's also alot more common in webpages that you would expect. Everytime you see a URL similar to this (bold added for emphasis):

http://www.somelargenewssite.com/articles/09/08/a4/gibberish-page-key-abc123.html

...it's accomplishing the same effect. It's not about tracking articles by year and month, it's about improving the page load performance on the client by reducing the time the webserver spends looking for the page.

If at all possible, avoid 100,000 files per directory. Try to aim for 1,000 - 10,000 instead. If you are unsure how you'll accomplish this, just take the first letter of the file and make that an additional directory, i.e.

http://mysite.com/subpage/abcdefg1234567.php

becomes

http://mysite.com/subpage/a/abcdefg1234567.php

If that doesn't reduce your file count, you can take the 2nd letter, or 3rd, etc. until you have file counts down to a manageable size.

http://mysite.com/subpage/a/b/c/abcdefg1234567.php

This process requires minimal coding on your part, is easily accomodated by the filenames alone, and will improve your access times regardless of the filesystem you use.

Solution 2:

From Novell's web site:

Another way to overcome the limitation of 32000 subdirectories for the EXT3 file system is to increase the directories i-nodes maximum count to 65500 for the EXT3 kernel module, then recompile and build the new kernel from existing kernel sources. REF

That being said, use a database.

Solution 3:

You need to use a file system that uses something like B+Tree examples of these are XFS JFS. Note no file system is good at storing files like that, you would be much better using a hashing scheme if you control the code that is writing in to the directory.