Storing and backing up 10 million files on Linux

I run a website where about 10 million files (book covers) are stored in 3 levels of subdirectories, ranging [0-f]:

0/0/0/
0/0/1/
...
f/f/f/

This leads to around 2400 files per directory, which is very fast when we need to retrieve one file. This is moreover a practice suggested by many questions.

However, when I need to backup these files, it takes many days just to browse the 4k directories holding 10m files.

So I'm wondering if I could store these files in a container (or in 4k containers), which would each act exactly like a filesystem (some kind of mounted ext3/4 container?). I guess this would be almost as efficient as accessing directly a file in the filesystem, and this would have the great advantage of being copied to another server very efficiently.

Any suggestion on how to do this best? Or any viable alternative (noSQL, ...) ?


Options for quickly accessing and backing up millions of files

Borrow from people with similar problems

This sounds very much like an easier sort of problem that faces USENET news servers and caching web proxies: hundreds of millions of small files that are randomly accessed. You might want to take a hint from them (except they don't typically ever have to take backups).

http://devel.squid-cache.org/coss/coss-notes.txt

http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4074B50D266E72C69D6D35FEDCBBA83D?doi=10.1.1.31.4000&rep=rep1&type=pdf

Obviously the cyclical nature of the cyclic news filesystem is irrelevant to you, but the lower level concept of having multiple disk files/devices with packed images and a fast index from the information the user provides to look up the location information is very much appropriate.

Dedicated filesystems

Of course, these are just similar concepts to what people were talking about with creating a filesystem in a file and mounting it over loopback except you get to write your own filesystem code. Of course, since you said your system was read-mostly, you could actually dedicate a disk partition (or lvm partition for flexibility in sizing) to this one purpose. When you want to back up, mount the filesystem read-only and then make a copy of the partition bits.

LVM

I mentioned LVM above as being useful to allow dynamic sizing of a partition so that you don't need to back up lots of empty space. But, of course, LVM has other features which might be very much applicable. Specifically the "snapshot" functionality which lets you freeze a filesystem at a moment in time. Any accidental rm -rf or whatever would not disturb the snapshot. Depending on exactly what you are trying to do, that might be sufficient for your backups needs.

RAID-1

I'm sure you are familiar with RAID already and probably already use it for reliability, but RAID-1 can be used for backups as well, at least if you are using software RAID (you can use it with hardware RAID, but that actually gives you lower reliability because it may require the same model/revision controller to read). The concept is that you create a RAID-1 group with one more disk than you actually need connected for your normal reliability needs (eg a third disk if you use software RAID-1 with two disks, or perhaps a large disk and a hardware-RAID5 with smaller disks with a software RAID-1 on top of the hardware RAID-5). When it comes time to take a backup, install a disk, ask mdadm to add that disk to the raid group, wait until it indicates completeness, optionally ask for a verification scrub, and then remove the disk. Of course, depending on performance characteristics, you can have the disk installed most of the time and only removed to exchange with an alternate disk, or you can have the disk only installed during backups).


You could mount a virtual filesystem using the loopback manager but while this would speed up your backup process, it might affect normal operations.

Another alternative is to backup the entire device using dd. For example, dd if=/dev/my_device of=/path/to/backup.dd.


As you probably know, your problem is locality. A typical disk seek takes 10ms or so. So just calling "stat" (or open()) on 10 million randomly-placed files requires 10 million seeks, or around 100000 seconds, or 30 hours.

So you must put your files into larger containers, such that the relevant number is your drive bandwidth (50-100 MB/sec for a single disk, typically) rather than your seek time. Also so you can throw a RAID at it, which lets you crank up the bandwidth (but not reduce seek time).

I am probably not telling you anything you do not already know, but my point is that your "container" idea will definitely solve the problem, and just about any container will do. Loopback mounts will likely work as well as anything.


There are a couple of options. The simplest, and should work with all Linux filesystems, is to dd copy the entire partition (/dev/sdb3 or /dev/mapper/Data-ImageVol) to a single image and archive that image. In case of restoring singular files, loopback mount the image (mount -o loop /usr/path/to/file /mountpoint) and copy out the files you need. For a full partition restore, you can reverse the direction of the initial dd command, but you really do need a partition of identical size.

Judging from your use-case, I'm guessing individual file-restores are a very infrequent event, if they ever occur at all. This is why an image-based backup really makes sense here. If you do need to make individual restores more often, using staged LVM snapshots will be a lot more convenient; but you still need to do the image-based backup for those critical "we lost everything" disasters. Image-based restores tend to go a lot faster than tar-based restores simply because it's just restoring blocks, it isn't incurring quite a bit of metadata operations with every fopen/fclose, and can also be a highly sequential disk-operation for further speed increases.

Alternately, as the Google video @casey pointed to mentions about half way through, XFS is a great filesystem (if complex). One of the nicer utilities with XFS is the xfsdump utility, which will dump an entire filesystem to a single file, and generally do so faster than tar can. It's a filesystem-specific utility, so can take advantage of fs internals in ways that tar can't.