What is the fastest way to create a checksum for large files in C#
Solution 1:
The problem here is that SHA256Managed
reads 4096 bytes at a time (inherit from FileStream
and override Read(byte[], int, int)
to see how much it reads from the filestream), which is too small a buffer for disk IO.
To speed things up (2 minutes for hashing 2 Gb file on my machine with SHA256, 1 minute for MD5) wrap FileStream
in BufferedStream
and set reasonably-sized buffer size (I tried with ~1 Mb buffer):
// Not sure if BufferedStream should be wrapped in using block
using(var stream = new BufferedStream(File.OpenRead(filePath), 1200000))
{
// The rest remains the same
}
Solution 2:
Don't checksum the entire file, create checksums every 100mb or so, so each file has a collection of checksums.
Then when comparing checksums, you can stop comparing after the first different checksum, getting out early, and saving you from processing the entire file.
It'll still take the full time for identical files.
Solution 3:
As Anton Gogolev noted, FileStream reads 4096 bytes at a time by default, But you can specify any other value using the FileStream constructor:
new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 16 * 1024 * 1024)
Note that Brad Abrams from Microsoft wrote in 2004:
there is zero benefit from wrapping a BufferedStream around a FileStream. We copied BufferedStream’s buffering logic into FileStream about 4 years ago to encourage better default performance
source
Solution 4:
Invoke the windows port of md5sum.exe. It's about two times as fast as the .NET implementation (at least on my machine using a 1.2 GB file)
public static string Md5SumByProcess(string file) {
var p = new Process ();
p.StartInfo.FileName = "md5sum.exe";
p.StartInfo.Arguments = file;
p.StartInfo.UseShellExecute = false;
p.StartInfo.RedirectStandardOutput = true;
p.Start();
p.WaitForExit();
string output = p.StandardOutput.ReadToEnd();
return output.Split(' ')[0].Substring(1).ToUpper ();
}