For large files compress first then transfer or rsync -z? which would be fastest?

I have a ton of relativity small data files but they take up about 50 GB and I need them transferred to a different machine. I was trying to think of the most efficient way to do this.

Thoughts I had were to gzip the whole thing then rsync it and decompress it, rely on rsync -z for compression, gzip then use rsync -z. I am not sure which would be most efficient since I am not sure how exactly rsync -z is implemented. Any ideas on which option would be the fastest?

You can't "gzip the whole thing" as gzip only compress one file, you could create a tar file and gzip it to "gzip the whole thing" but you would loose rsync capability of copying only modified file.

So the question is: is it better to store file I need to rsync gziped or rely on -z option of rsync.
The answer is probably that you don't want the file unzipped on your server ? I guess yes, so I don't see how you could manage to gzip file before doing the rsync.

May be you don't need the rsync capability of copying only modified file ? In this case why using rsync instead of doing a scp of a tar.gz file containing your stuff ?

Anyway to answer the question, rsync gzip will be a little less efficient than gziping file with gzip. Why ? because rsync will gzip data chunk by chunk, so a smaller set of data will be used to create the table that gzip use to do compression, a bigger set of data (gzip would use the whole file at once) give a better compression table. But the difference will be very very small in most case but in very rare case the difference can be more important (if you have a very large file with very long partern repeating many time on the file but far away from each other) (This is a very simplified example)

@radius, a minor nit to pick about how gzip works - gzip is a block-based compression algorithm, and a fairly simple one at that. The whole file is not considered for the compression table - only each block. Other algorithms may use the whole contents of the file and there are a few that use the contents of multiple blocks or even variably-sized blocks. One fascinating example is lrzip, by the same author as rsync!

The skinny on gzip's algorithm.

So, in summary, using rsync -z will likely yield the same compression as gziping first - and if you're doing a differential transfer, better because of rsync's diffing algorithm.

That said, I think one will find that regular scp handily beats rsync for non-differential transfers - because it will have far less overhead than rsync's algorithm (which would use scp under-the-hood anyway!)

If your network does become a bottleneck, then you would want to use compression on the wire.

If your disks are the bottleneck, that's when streaming into a compressed file would be best. (for example, netcat from one machine to the next, streaming into gzip -c)

Usually, if speed is key, compressing an existing file before-hand is wasteful.

TIMTOWTDI, YMMV, IANAL, etc.

If you're only copying the data once, rsync isn't going to be a big win in and of itself. If you like gzip, (or tar+gzip, since you have many files), you might try something like:

tar -cz /home/me/source/directory | ssh target tar -xz --directory /home/you/target/directory

That would get the compression you are looking for and just copy directly without involving rsync.

According to this guy it may just be faster to use rsync -z, although I would guess it would be close to as efficient as compressing each file first before transferring. It should be faster than compressing the tar stream, as suggested by others.

From the man page:

          Note  that  this  option  typically  achieves better compression
          ratios than can be achieved by using a compressing remote  shell
          or  a  compressing  transport  because it takes advantage of the
          implicit information in the matching data blocks  that  are  not
          explicitly sent over the connection.

For large files compress first then transfer or rsync -z? which would be fastest?

Related

Recent Posts