For large files compress first then transfer or rsync -z? which would be fastest?
I have a ton of relativity small data files but they take up about 50 GB and I need them transferred to a different machine. I was trying to think of the most efficient way to do this.
Thoughts I had were to gzip the whole thing then rsync it and decompress it, rely on rsync -z for compression, gzip then use rsync -z. I am not sure which would be most efficient since I am not sure how exactly rsync -z is implemented. Any ideas on which option would be the fastest?
You can't "gzip the whole thing" as gzip only compress one file, you could create a tar file and gzip it to "gzip the whole thing" but you would loose rsync capability of copying only modified file.
So the question is: is it better to store file I need to rsync gziped or rely on -z option of rsync.
The answer is probably that you don't want the file unzipped on your server ? I guess yes, so I don't see how you could manage to gzip file before doing the rsync.
May be you don't need the rsync capability of copying only modified file ? In this case why using rsync instead of doing a scp of a tar.gz file containing your stuff ?
Anyway to answer the question, rsync gzip will be a little less efficient than gziping file with gzip. Why ? because rsync will gzip data chunk by chunk, so a smaller set of data will be used to create the table that gzip use to do compression, a bigger set of data (gzip would use the whole file at once) give a better compression table. But the difference will be very very small in most case but in very rare case the difference can be more important (if you have a very large file with very long partern repeating many time on the file but far away from each other) (This is a very simplified example)
@radius, a minor nit to pick about how gzip
works - gzip
is a block-based compression algorithm, and a fairly simple one at that. The whole file is not considered for the compression table - only each block. Other algorithms may use the whole contents of the file and there are a few that use the contents of multiple blocks or even variably-sized blocks. One fascinating example is lrzip
, by the same author as rsync
!
The skinny on gzip
's algorithm.
So, in summary, using rsync -z
will likely yield the same compression as gzip
ing first - and if you're doing a differential transfer, better because of rsync
's diffing algorithm.
That said, I think one will find that regular scp
handily beats rsync
for non-differential transfers - because it will have far less overhead than rsync
's algorithm (which would use scp
under-the-hood anyway!)
If your network does become a bottleneck, then you would want to use compression on the wire.
If your disks are the bottleneck, that's when streaming into a compressed file would be best. (for example, netcat
from one machine to the next, streaming into gzip -c
)
Usually, if speed is key, compressing an existing file before-hand is wasteful.
TIMTOWTDI, YMMV, IANAL, etc.
If you're only copying the data once, rsync isn't going to be a big win in and of itself. If you like gzip, (or tar+gzip, since you have many files), you might try something like:
tar -cz /home/me/source/directory | ssh target tar -xz --directory /home/you/target/directory
That would get the compression you are looking for and just copy directly without involving rsync.
According to this guy it may just be faster to use rsync -z
, although I would guess it would be close to as efficient as compressing each file first before transferring. It should be faster than compressing the tar stream, as suggested by others.
From the man page:
Note that this option typically achieves better compression
ratios than can be achieved by using a compressing remote shell
or a compressing transport because it takes advantage of the
implicit information in the matching data blocks that are not
explicitly sent over the connection.