Can I validate a large file download piecemeal over http

Solution 1:

On the server side, you can use dd and md5sum to checksum each chunk of the file:

#!/bin/bash
FILENAME="$1"
FILESIZE=`stat --printf="%s" $FILENAME`
CHUNKSIZE=536870912 # 512MB
CHUNKNUM=0
while ! grep -q 'cannot skip' hash.log 2> /dev/null ; do
    dd if=$FILENAME bs=$CHUNKSIZE skip=$CHUNKNUM count=1 2> hash.log | md5sum >> $FILENAME.md5
    CHUNKNUM=$(( CHUNKNUM + 1 ))
done
rm hash.log

You will be left with a single $FILENAME.md5 file with all chunk hashes.

You can now download that large file and the checksums, run this script on the file and compare the hashes. If any piece gets a mismatched hash, you can use curl to download only part of the file (if the server supports RANGE) and patch the file with dd.

For example, if the chunk 2 gets a hash mismatch:

curl -s -r 536870912-1073741824 | dd of=somelargetarfile.tar seek=536870912 conv=notrunc

This will download the chunk 2, and patch the large tar file with it.

Solution 2:

ThoriumBR's answer is good, but I would like to add some additional advice in case you can't access the remote server.

You already have one (or more) bad download(s) locally.
Using the split trick given by ThoriumBR you can split those files locally and make use of the good parts.
Compare each of those chunks with the same chunk downloaded using curl (as per ThoriumBR's last instruction). If you have 2 identical chunks (binary diff, no need for slow md5) you can be relatively certain that is a good chunk. So save it somewhere else and repeat with the next chunk.

So: For each chunk: Compare your local copies (if you have more than 1) and add freshly downloaded copies and compare until you find 2 identical chunks: That is the one to keep.

It is a fair bit of manual work, but doable. You can even script the whole process, but doing that (and debugging the script) may not be worth the effort.