Copy 10 million images in a single folder to another server

Now I know you shouldn't ever put 10 million files into a single directory to begin with. Blame it on the developers, but as it stands that’s where I am at. We will be fixing it and moving them into folder groups, but first we gotta get them copied off of the production box.

I first tried rsync but it would fail out miserably. I assume it was because storing the name and path of the files in memory was greater than the ram and swap space.

Then I tried to compress it all into a tar.gz but it couldn't unzip it, file too large error (it was 60gigs).

I tried to just do a tar to tar exaction, but I got a "cannot open: file too large"

tar c images/ | tar x –C /mnt/coverimages/

Extra Info:

/mnt/coverimages/ is an nfs share where we want to move the images to.

All files are images

OS: Gentoo


Solution 1:

If you install version 3+ of rsync it will do a rolling list of files to transfer and won't need to keep the entire file list in memory. In the future you probably want to consider hashing the filenames and creating a directory structure based on parts of those hashes.

You can see this answer to get an idea of what I mean with the hashing.

Solution 2:

If I could arrange the downtime I'd simple move the disk temporarily.

Solution 3:

have you tried using find and -exec (or xargs), something like

find images/ -exec cp "{}" /mnt/coverimages/ \;

?

Solution 4:

I don't quite think that you have the "tar | tar" command quite right. Try this

tar cf - images/ | cd /mnt/coverimages && tar xf -

Another option would be to stream over SSH (some CPU overhead for encryption):

tar cf - images/ | ssh user@desthost "cd /path/coverimages && tar xf -"

There's also cpio, which a bit more obscure, but offers similar functionality:

find images/ | cpio -pdm /mnt/coverimages/