How does the number of subdirectories impact drive read / write performance on Linux?

I've got an EXT3 formatted drive on a Linux CentOS server. This is a web app data drive and contains a directory for every user account ( there are 25,000 users ). Each folder contains files that that user has uploaded. Overall, this drive has roughly 250GB of data on it.

Does structuring the drive with all these directories impact drive read/write performance? Does it impact some other performance aspect I'm not aware of?

Is there anything inherently wrong or bad with structuring things this way? Perhaps just the wrong choice of filesystem?

I've recently tried merging two data drives and realized that EXT3 is limited to 32,000 subdirectories. This got me wondering why. It seems silly that I built it this way, considering each file has a unique id that corresponds to an id in the database. Alas ...


Solution 1:

This is easy to test the options for yourself, in your environment and compare the results. Yes, there is a negative impact on performance as the number of directories increases. Yes, other filesystems can help get around those barriers or reduce the impact.

The XFS filesystem is better for this type of directory structure. ext4 is probably just fine nowadays. Access and operations on the directory will simply slow down as the number of subdirectories and files increase. This is very pronounced under ext3 and not so much on XFS.

Solution 2:

The answer isn't as simple as the choice of filesystem. Sane filesystems stopped using linear lists for directories long ago, meaning that the number of entries in a directory doesn't affect file access time....

except when it does.

In fact, each operation stays fast and efficient no matter the number of entries, but some tasks involve a growing number of operations. Obviously, doing a simple ls takes a long time, and you don't see a thing until all inodes have been read and sorted. Doing ls -U (unsorted) helps a little because you can see it's not dead, but doesn't reduce time perceptively. Less obvious is that any wildcard expansion have to check each and every filename, and it seems that in most cases the whole inode has to be read too.

In short: if you can be positively sure that no application (including shell access) will ever use any wildard, then you can get huge directories without any remorse. But if there might be some wildcards lurking in the code, better keep directories below a thousand entries each.

edit:

All modern filesystems use good data structures for big directories, so a single operation that has to find the inode of a specific file will be quite fast even on humongous directories.

But, most applications don't do just single-operations. Most of them will do either a full directory or a wildcard-matching. Those are slow no matter what, because they involve reading all entries.

For example: lets say you have a directory with a million files called 'foo-000000.txt' through 'foo-999999.txt' and a single 'natalieportman.jpeg'. These will be fast:

  • ls -l foo-123456.txt
  • open "foo-123456.txt"
  • delete "foo-123456.txt"
  • create "bar-000000.txt"
  • open "natalieportman.jpeg"
  • create "big_report.pdf"

these will fail, but fail fast too:

  • ls -l bar-654321.txt
  • open bar-654321.txt
  • delete bar-654321.txt

these will be slow, even if they return very few results; even those that fail, fail after scanning all entries:

  • ls
  • ls foo-1234*.txt
  • delete *.jpeg
  • move natalie* /home/emptydir/
  • move *.tiff /home/seriousphotos/

Solution 3:

First make sure that the ext3 partition has the dir_index flag set.

sudo dumpe2fs /dev/sdaX |grep --color dir_index

If it is missing, you can enable it. You need to unmount the filesystem, then run:

sudo tune2fs -O dir_index /dev/sdaX
sudo e2fsck -Df /dev/sdaX

Then mount the filesystem.