Creating a tar file with checksums included
Here's my problem : I need to archive to tar files a lot ( up to 60 TB) of big files (usually 30 to 40 GB each). I would like to make checksums ( md5, sha1, whatever) of these files before archiving; however not reading every file twice (once for checksumming, twice for tar'ing) is more or less a necessity to achieve a very high archiving performance (LTO-4 wants 120 MB/s sustained, and the backup window is limited).
So I'd need some way to read a file, feeding a checksumming tool on one side, and building a tar to tape on the other side, something along :
tar cf - files | tee tarfile.tar | md5sum -
Except that I don't want the checksum of the whole archive (this sample shell code does just this) but a checksum for each individual file in the archive.
I've studied GNU tar, Pax, Star options. I've looked at the source from Archive::Tar. I see no obvious way to achieve this. It looks like I'll have to hand-build something in C or similar to achieve what I need. Perl/Python/etc simply won't cut it performance-wise, and the various tar programs miss the necessary "plugin architecture". Does anyone know of any existing solution to this before I start code-churning ?
Before going ahead and rewriting tar, you may want to profile the quick-and-easy method of reading the data twice, as it may not be much slower than doing it in one pass.
The two pass method is implented here:
http://www.g-loaded.eu/2007/12/01/veritar-verify-checksums-of-files-within-a-tar-archive/
with the one-liner:
tar -cvpf mybackup.tar myfiles/| xargs -I '{}' sh -c "test -f '{}' &&
md5sum '{}'" | tee mybackup.md5
While its true that md5sum is reading each file from disk in parallel with tar, instead of getting the data streamed through the pipe, Linux disk cacheing should make this second read a simple read from a memory buffer, which shouldn't really be slower than a stdin read. You just need to make sure you have enough space in your disk cache to store enough of each file that the 2nd reader is always reading from the cache and not getting far enough behind to have to retrieve from disk
Here's an example Python script. It calculates the checksum of the file as its being added to the archive. At the end of the script, the checksum file is added to the archive.
import hashlib,os
import tarfile
def md5(filename):
''' function to get md5 of file '''
d = hashlib.md5()
try:
d.update(open(filename).read())
except Exception,e:
print e
else:
return d.hexdigest()
root="/home"
outtar=os.path.join(root,"path1","output.tar")
path = os.path.join(root,"path1")
chksum_file=os.path.join(root,"path","chksum.txt")
tar = tarfile.open(outtar, "w")
o_chksum=open(chksum_file,"w")
for r,d,f in os.walk(path):
for files in f:
filename=os.path.join(r,files)
digest="%s:%s"%(md5(filename) , filename)
o_chksum.write(digest+"\n")
tar.add(os.path.join(r,files))
tar.add(chksum_file)
tar.close()
o_chksum.close()
When you untar, use the chksum_file to verify the checksum
I think that your problem is a design issue of tar as tar does not allow random access/positioning inside the archive file via a content table, thus all protocols will be file and not buffer based.
Thus you may look at different formats like PAX or DAR which allow random access.