Storing a large number of images

We had a similar problem in the past. And found a nice solution:

  • Give each image an unique guid.
  • Create a database record for each image containing the name, location, guid and possible location of sub images (thumbnails, reducedsize, etc.).
  • Use the first (one or two) characters of the guid to determine the toplevel folder.
  • If the folders have too much files, split again. Update the references and you are ready to go.
  • If the number of files and the accesses are too high, you can spread folders over different file servers.

We have experienced that using the guids, you get a more or less uniform division. And it worked like a charm.

Links which might help to generate a unique ID:

  • http://en.wikipedia.org/wiki/Universally_Unique_Identifier
  • http://en.wikipedia.org/wiki/Sha1

I worked on an Electronic Document Management system a few years ago, and we did pretty much what Gamecat and wic suggested.

That is, assign each image a unique ID, and use that to derive a relative path to the image file. We used MOD similar to what wic suggested, but we allowed 1024 folders/files at each level, with 3 levels, so we could support 1G files.

We stripped off the extension from the files however. The DB records contained the MIME Type, so extension was not needed.

I would not recommend storing the full URL in the DB record, only the Image ID. If you store the URL you can't move or restructure your storage without converting your DB. A relative URL would be ok since that way you can at least move the image repository around, but you'll get more flexibility if you just store the ID and derive the URL.

Also, I would not recommend allowing direct references to your image files from the web. Instead, provide a URL to a server-side program (e.g., Java Servlet), with the Image ID being supplied in the URL Query (http://url.com/GetImage?imageID=1234).

The servlet can use that ID to look up the DB record, determine MIME Type, derive the actual location, check for security restrictions, logging, etc.


I usually just use the numerical database id (auto_increment) and then use the modulu (%) operator to figure out where to put the file. Simple and scalable. For instance the path to image with id 12345 could be created like this:

12345 % 100 = 45
12345 % 1000 = 345

Ends up in:

/home/joe/images/345/45/12345.png

Or something like that.

If you're using Linux and ext3 and the filesystem, you must be aware that there are limits to the number of directories and files you can have in a directory. The limit is 32000 for dirs, so you should always strive to keep number of dirs low.


I know is impractical to have all of them sitting at the same directory in the server as it would slow access to a crawl.

This is an assumption.

I have designed systems where we had millions of files stored flat in one directory, and it worked great. It's also the easiest system to program. Most server filesystems support this without a problem (although you'd have to check which one you were using).

http://www.databasesandlife.com/flat-directories/


When saving files associated with an auto_increment ids, I use something like the following, which creates three directory levels, each comprised of 1000 dirs, and 100 files in each third-level directory. This supports ~ 100 billion files.

if $id = 99532455444 then the following returns /995/324/554/44

function getFileDirectory($id) {
    $level1 = ($id / 100000000) % 100000000;
    $level2 = (($id - $level1 * 100000000) / 100000) % 100000;
    $level3 = (($id - ($level1 * 100000000) - ($level2 * 100000)) / 100) % 1000;
    $file   = $id - (($level1 * 100000000) + ($level2 * 100000) + ($level3 * 100));

    return '/' . sprintf("%03d", $level1)
         . '/' . sprintf("%03d", $level2)
         . '/' . sprintf("%03d", $level3)
         . '/' . $file;
}