How can I verify that a 1TB file transferred correctly?
You can use tee to do the sum on the fly with something like this (adapt the netcat commands for your needs):
Server:
netcat -l -w 2 1111 | tee >( md5sum > /dev/stderr )
Client:
tee >( md5sum > /dev/stderr ) | netcat 127.0.0.1 1111
Nerdwaller's answer about using tee
to simultaneously transfer and calculate a checksum is a good approach if you're primarily worried about corruption over the network. It won't protect you against corruption on the way to disk, etc., though, as its taking the checksum before it hits disk.
But I'd like to add something:
1 TiB / 40 minutes ≈ 437 MiB/sec1.
That's pretty fast, actually. Remember that unless you have a lot of RAM, that's got to come back from storage. So the first thing to check is to watch iostat -kx 10
as you run your checksums; in particular you want to pay attention to the %util
column. If you're pegging the disks (near 100%), then the answer is to buy faster storage.
Otherwise, as other posters mentioned, you can try different checksum algorithms. MD4, MD5, and SHA-1 are all designed to be cryptographic hashes (though none of those should be used for that purpose anymore; all are considered too weak). Speed wise, you can compare them with openssl speed md4 md5 sha1 sha256
. I've thrown in SHA256 to have at least one still strong enough hash.
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md4 61716.74k 195224.79k 455472.73k 695089.49k 820035.58k
md5 46317.99k 140508.39k 320853.42k 473215.66k 539563.35k
sha1 43397.21k 126598.91k 283775.15k 392279.04k 473153.54k
sha256 33677.99k 75638.81k 128904.87k 155874.91k 167774.89k
Of the above, you can see that MD4 is the fastest, and SHA256 the slowest. This result is typical on PC-like hardware, at least.
If you want even more performance (at the cost of being trivial to tamper with, and also less likely to detect corruption), you want to look at a CRC or Adler hash. Of the two, Adler is typically faster, but weaker. Unfortunately, I'm not aware of any really fast command line implementations; the programs on my system are all slower than OpenSSL's md4.
So, your best bet speed-wise is openssl md4 -r
(the -r
makes it look like md5sum output).
If you're willing to do some compiling and/or minimal programming, see Mark Adler's code over on Stack Overflow and also xxhash. If you have SSE 4.2, you will not be able to beat the speed of the hardware CRC instruction.
1 1 TiB = 1024⁴ bytes; 1 MiB = 1024² bytes. Comes to ≈417MB/sec with powers-of-1000 units.
The openssl
command supports several message digests. Of the ones I was able to try, md4
seems to run in about 65% of the time of md5
, and about 54% of the time of sha1
(for the one file I tested with).
There's also an md2
in the documentation, but it seems to give the same results as md5
.
Very roughly, speed seems to be inversely related to quality, but since you're (probably) not concerned about an adversary creating a deliberate collision, that shouldn't be much of an issue.
You might look around for older and simpler message digests (was there an md1
, for example)?
A minor point: You've got a Useless Use of cat
. Rather than:
cat foo.box | nc <archive IP> 1234
you can use:
nc <archive IP> 1234 < foo.box
or even:
< foo.box nc <archive IP> 1234
Doing so saves a process, but probably won't have any significant effect on performance.
Two options:
Use sha1sum
sha1sum foo.box
In some circumstances sha1sum is faster.
Use rsync
It will take longer to transfer, but rsync verifies that the file arrived intact.
From the rsync man page
Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file is transferred...
Science is progressing. It appears that the new BLAKE2 hash function is faster than MD5 (and cryptographically much stronger to boot).
Reference: https://leastauthority.com/blog/BLAKE2-harder-better-faster-stronger-than-MD5.html
From Zooko's slides:
cycles per byte on Intel Core i5-3210M (Ivy Bridge)
function cycles per byte
long msg 4096 B 64 B MD5 5.0 5.2 13.1 SHA1 4.7 4.8 13.7 SHA256 12.8 13.0 30.0 Keccak 8.2 8.5 26.0 BLAKE1 5.8 6.0 14.9 BLAKE2 3.5 3.5 9.3