Moving 2TB (10 mil files + dirs), what's my bottleneck?

Solution 1:

Ever heard of splitting large tasks into smaller tasks?

/home/data/repo contains 1M dirs, each of which contain 11 dirs and 10 files. It totals 2TB.

rsync -a /source/1/ /destination/1/
rsync -a /source/2/ /destination/2/
rsync -a /source/3/ /destination/3/
rsync -a /source/4/ /destination/4/
rsync -a /source/5/ /destination/5/
rsync -a /source/6/ /destination/6/
rsync -a /source/7/ /destination/7/
rsync -a /source/8/ /destination/8/
rsync -a /source/9/ /destination/9/
rsync -a /source/10/ /destination/10/
rsync -a /source/11/ /destination/11/

(...)

Coffee break time.

Solution 2:

This is what is happening:

  • Initially rsync will build the list of files.
  • Building this list is really slow, due to an initial sorting of the file list.
  • This can be avoided by using ls -f -1 and combining it with xargs for building the set of files that rsync will use, or either redirecting output to a file with the file list.
  • Passing this list to rsync instead of the folder, will make rsync to start working immediately.
  • This trick of ls -f -1 over folders with millions of files is perfectly described in this article: http://unixetc.co.uk/2012/05/20/large-directory-causes-ls-to-hang/

Solution 3:

Even if rsync is slow (why is it slow? maybe -z will help) it sounds like you've gotten a lot of it moved over, so you could just keep trying:

If you used --remove-source-files, you could then follow-up by removing empty directories. --remove-source-files will remove all the files, but will leave the directories there.

Just make sure you DO NOT use --remove-source-files with --delete to do multiple passes.

Also for increased speed you can use --inplace

If you're getting kicked out because you're trying to do this remotely on a server, go ahead and run this inside a 'screen' session. At least that way you can let it run.