Windows Server 2008 Software Raid 5 - Data integrity issues

As none of the operations you are performing is failing, running chkdsk /R probably would not yield any results - chkdsk would not be able to recover anything which is not detected as corrupt.

Data corruptions like the ones you are seeing would have a couple of possible sources:

  1. bit flips in software RAID algorithm execution before writing data
  2. bit flips in hardware implementation when writing data
  3. bit flips on magnetic media
  4. bit flips in hardware implementation when reading data

You should choose a methodic approach to exclude the ones you can exclude:

  • Number 4 - bit flips when reading - should quite easily be recognizeable by the fact that the flips would occur for different areas of data, so md5 or sha1 hashes would differ from time to time you are trying to compute them over a large file

  • Number 3 - flips on magnetic media - are rather unlikely to go undetected since every hard disk does include forward error correction algorithms as well as error detection checksums and you definitely would see unrecoverable sector read errors by a number of magnitudes more often than bit flips sliding through - taking a look at the SMART unrecoverable read errors should be sufficient to exclude this one

  • Number 2 - this one can be quite hard to detect. Although the SATA protocol protects the transmitted data by error correction algorithms where the logic of the 3. case would apply and slow any transmission to a crawl before letting through a flipped bit sector, the corruption might happen somewhere else and go undetected - in buffers for example.

  • Number 1 - I would regard this the most likely case. Either a bug in the implementation or (more likely as a bug of this significance probably would be noted and documented by somewhere else 4 years after the OS release) a hardware failure like defective RAM could cause this kind of bit flips. Do a couple of memtest passes to exclude the RAM, especially if you are not using ECC memory. Re-run your tests in a similar environment with the same software configuration (preferably an image of your system) to exclude a software-based cause.

You also might extend your tests to copying 15 GB worth of smaller files just to see if the corruption also would affect one of them after a certain amount of data written. If this were the case (which appears likely given your description), you should assume that similar corruption has happened to data already placed on your disks - try comparing to original data or known-good cryptographic hashes with larger files to estimate the degree of corruption.

Also, the ability to run a re-calculation of the XOR checksums and comparing them to the parity data stored on the disks would have been nice and most RAID 5 systems offer this functionality which is typically called "scrubbing". With Windows, there seems to be no way to do this out-of-the-box. I was only able to find data recovery services doing this for you.

Good hunting.