Best way to copy millions of files between 2 servers

Solution 1:

Something like this should work well:

tar c some/dir | gzip - |  ssh host2 tar xz

Maybe also omit gzip and the "z" flag for extraction, since you are on a gigabit network.

Solution 2:

I'm sure the fact that you have all FIVE MILLION files in a single directory will throw many tools into a tizzy. I'm not surprised that rsync didn't handle this gracefully - it's quite a "unique" situation. If you could figure out a way to structure the files into some sort of directory structure, I'm sure the standard sync tools such as rsync would be much more responsive.

However, just to give some actual advice - perhaps one solution would be to move the drive physically into the destination machine temporarily so you can do a copy of the files in the actual server (not over the network). Then, move the drive back and use rsync to keep things up to date.

Solution 3:

To copy millions of files over a gigabit switch (in a trusted environment) you may also use a combination of netcat (or nc) and tar, as already suggested by user55286. This will stream all the files as one large file (see Fast File Copy - Linux! (39 GBs)).

# requires netcat on both servers
nc -l -p 2342 | tar -C /target/dir -xzf -   # destination box
tar -cz /source/dir | nc Target_Box 2342    # source box

Solution 4:

We had about 1 million files in a directory (about 4 year's worth of files).

And we used robocopy to move files to YYYY/MM directory (about 35-45,000 files per month).. we put robocopy script in a .bat file like this:

ROBOCOPY /NS /NC /NFL /NP /LOG+:H:\BCK_REPORT\ROBO.LOG /MAXAGE:20081101 /MINAGE:20081201 /MOV H:\Cs\out\fix H:\BCK_REPORT\2008\11
ROBOCOPY /NS /NC /NFL /NP /LOG+:H:\BCK_REPORT\ROBO.LOG /MAXAGE:20081201 /MINAGE:20090101 /MOV H:\Cs\out\fix H:\BCK_REPORT\2008\12
ROBOCOPY /NS /NC /NFL /NP /LOG+:H:\BCK_REPORT\ROBO.LOG /MAXAGE:20090101 /MINAGE:20090201 /MOV H:\Cs\out\fix H:\BCK_REPORT\2009\01
ROBOCOPY /NS /NC /NFL /NP /LOG+:H:\BCK_REPORT\ROBO.LOG /MAXAGE:20090201 /MINAGE:20090301 /MOV H:\Cs\out\fix H:\BCK_REPORT\2009\02

brief notes.. /ns /nc /nfl /np is to avoid bloating the log file with additional info /log+... is to write summary information to log file.

/minage and /maxage is to copy files modified with in that date range. 

so for example files modified >= 01/Nov/2008 (inclusive) to files modified < 01/Dec/2008 (not inclusive)

ROBOCOPY /NS /NC /NFL /NP /LOG+:H:\BCK_REPORT\ROBO.LOG /MAXAGE:20081101 /MINAGE:20081201 /MOV H:\Cs\out\fix H:\BCK_REPORT\2008\11

/mov to move the files

then comes source directory

then comes destination directory (directories will be created on the fly as and when required).

It took about 40 - 60 minutes for 1 month worth of transfer (about 35-45,000 files) We reckon it takes about 12 hours or less for 1 year worth of transfer.

Using Windows Server 2003.

All the stuff is logged in the log file... Start Time, End Time and Number of files Copied.

Robocopy saved the day.

Solution 5:

You know, I plus-1'd the tar solution, but -- depending on the environment -- there's one other idea that occurs. You might think about using dd(1). The speed issue with something like this is that it takes many head motions to open and close a file, which you'll be doing five million times. In you could ensure that these are assigned contguously, you could dd them instead, which would cut the number of head motions by a factor of 5 or more.