How to copy large (> 1 million) number of small files between two servers

I'd be tempted to answer "stop abusing the file system by treating it like a database" but I'm sure it wouldn't help you much ;)

First, you need to understand that if your limitation is in the bandwidth available on read, there isn't anything you can do to improve performance using a simple synch command. In such a case, you'll have to split the data when it's written either by changing the way the files are created (which means, as you guessed correctly, asking the devs to change the source program) or by using a product that does does geo-mirroring (like, for instance double-take: check around as I'm sure you'll find alternatives, that's just an example).

In similar cases, the main cause of problem isn't typically the file data but rather the meta-data access. Your first strategy will therefore be to divide the load into multiple process that act on (completely) different directories: that should help the file system keep up with providing you with the meta-data you need.

Another strategy is to use your backup system for that: replay your last incremental backups on the target to keep the database in sync.

Finally, there are more exotics strategies that can be applied in specific cases. For instance, I solved a similar problem on a Windows site by writing a program that loaded the files into the file system every few minutes, thus keeping the FS clean.


I don't think anything has changed. If you can quiesce the data on the source system, I think some variant of tar will be the fastest. If not, rsync is still the next best way, making sure to use the whole-file switch and a less CPU-intensive compression algorithm (e.g. arcfour). Do you have any option to perform a block-level copy? You mention iSCSI storage. Will the new system have iSCSI-attached storage as well?