Rsync huge dataset of small files 5TB, +M small files

Try xargs+rsync:

 find . -type f -print0 | xargs -J % -0 rsync -aP % user@host:some/dir/

You can control how many files to pass as source to each call of rsync with -n E.g. to copy 200 files at every rsync:

 find . -type f -print0 | xargs -n 200 -J % -0 rsync -aP % user@host:some/dir/

If it's too slow you can run multiple copies of rsync in parallel with the -P option:

find . -type f -print0 | xargs -P 8 -n 200 -J % -0 rsync -aP % user@host:some/dir/

This will start 8 copies of rsync in parallel.

If this is a trusted/secure network, and you can open a port on the target host, a good way to reproduce a tree on another machine is the combination of tar and netcat. I'm not at a terminal so cant write a full demonstration but this page does a pretty good job:

http://toast.djw.org.uk/tarpipe.html

Definitely use compression. In the best case you can transfer the data at the throughput rate the slowest of the three potential bottlenecks- read on the source, network, write on the target- permits.

Rsync huge dataset of small files 5TB, +M small files

Related

Recent Posts