Storing a million images in the filesystem
I have a project that will generate a huge number of images. Around 1,000,000 for start. They are not large images so I will store them all on one machine at start.
How do you recommended on storing these images efficiently? (NTFS file system currently)
I am considering a naming scheme... for start all the images will have an incremental name from 1 up I hope this will help me sort them later if needed, and throw them in different folders.
what would be a better naming scheme:
a/b/c/0 ... z/z/z/999
or
a/b/c/000 ... z/z/z/999
any idea on this ?
I'd recommend using a regular file system instead of databases. Using file system is easier than a database, you can use normal tools to access files, file systems are designed for this kind of usage etc. NTFS should work just fine as a storage system.
Do not store the actual path to database. Better to store the image's sequence number to database and have function that can generate path from the sequence number. e.g:
File path = generatePathFromSequenceNumber(sequenceNumber);
It is easier to handle if you need to change directory structure some how. Maybe you need to move the images to different location, maybe you run out of space and you start storing some of the images on the disk A and some on the disk B etc. It is easier to change one function than to change paths in database.
I would use this kind of algorithm for generating the directory structure:
- First pad you sequence number with leading zeroes until you have at least 12 digit string. This is the name for your file. You may want to add a suffix:
-
12345
->000000012345.jpg
-
- Then split the string to 2 or 3 character blocks where each block denotes a directory level. Have a fixed number of directory levels (for example 3):
-
000000012345
->000/000/012
-
- Store the file to under generated directory:
- Thus the full path and file filename for file with sequence id
123
is000/000/012/00000000012345.jpg
- For file with sequence id
12345678901234
the path would be123/456/789/12345678901234.jpg
- Thus the full path and file filename for file with sequence id
Some things to consider about directory structures and file storage:
- Above algorithm gives you a system where every leaf directory has maximum of 1000 files (if you have less that total of 1 000 000 000 000 files)
- There may be limits how many files and subdirectories a directory can contain, for example ext3 files system on Linux has a limit of 31998 sub-directories per one directory.
- Normal tools (WinZip, Windows Explorer, command line, bash shell, etc.) may not work very well if you have large number of files per directory (> 1000)
- Directory structure itself will take some disk space, so you'll do not want too many directories.
- With above structure you can always find the correct path for the image file by just looking at the filename, if you happen to mess up your directory structures.
- If you need to access files from several machines, consider sharing the files via a network file system.
- The above directory structure will not work if you delete a lot of files. It leaves "holes" in directory structure. But since you are not deleting any files it should be ok.
I'm going to put my 2 cents worth in on a piece of negative advice: Don't go with a database.
I've been working with image storing databases for years: large (1 meg->1 gig) files, often changed, multiple versions of the file, accessed reasonably often. The database issues you run into with large files being stored are extremely tedious to deal with, writing and transaction issues are knotty and you run into locking problems that can cause major train wrecks. I have more practice in writing dbcc scripts, and restoring tables from backups than any normal person should ever have.
Most of the newer systems I've worked with have pushed the file storage to the file system, and relied on databases for nothing more than indexing. File systems are designed to take that sort of abuse, they're much easier to expand, and you seldom lose the whole file system if one entry gets corrupted.
Ideally, you should run some tests on random access times for various structures, as your specific hard drive setup, caching, available memory, etc. can change these results.
Assuming you have control over the filenames, I would partition them at the level of 1000s per directory. The more directory levels you add, the more inodes you burn, so there's a push-pull here.
E.g.,
/root/[0-99]/[0-99]/filename
Note, http://technet.microsoft.com/en-us/library/cc781134(WS.10).aspx has more details on NTFS setup. In particular, "If you use large numbers of files in an NTFS folder (300,000 or more), disable short-file name generation for better performance, and especially if the first six characters of the long file names are similar."
You should also look into disabling filesystem features you don't need (e.g., last access time). http://www.pctools.com/guides/registry/detail/50/
I think most sites that have to deal with this use a hash of some sort to make sure that the files get evenly distributed in the folders.
So say you have a hash of a file that is something like this 515d7eab9c29349e0cde90381ee8f810
You could have this stored in the following location and you can use how ever many levels deep you need to keep the number of files in each folder low.\51\5d\7e\ab\9c\29\349e0cde90381ee8f810.jpg
I've seen this approach taken many times. You still need a database to map these file hashes to a human readable name and what ever other metadata you need to store. But this approach scales pretty well b/c you can start to distribute the hash address space between multiple computers and or storage pools, etc.
Whatever you do, don't store them all in one directory.
Depending on the distribution of the names of these images, you could create a directory structure where you have single letter top level folders where you would have another set of subfolders for the 2nd letter of images etc.
So:
Folder img\a\b\c\d\e\f\g\
would contain the images starting with 'abcdefg' and so on.
You could introduce your own appropriate depth required.
The great thing about this solution is that the directory structure effectively acts like a hashtable/dictionary. Given an image file name, you will know its directory and given a directory, you will know a subset of images that go there.