Single-bit error on RAID device?

Solution 1:

One possibility is a random bit flip in RAM or the controller on read in step 4. If data was corrupted on read then you would see it in step 4, then if it was still cached you'd also see it in step 6 when comparing files, since the corrupt cached data might still be used.

To test this case, power cycle all of your hardware to ensure the caches are cleared and try opening the file (and running the comparison with the backup) again. If all is well then this was the problem (there's no way to know at what stage of the read the bit flip occurred in so you'll just have to chalk it off as an unsolved mystery).

Failing this, a second, even unluckier possibility is perhaps a random RAM (or more likely on the RAID controller, based on your description) bit flip on write in step 1; but you were operating on a good cached copy in steps 2 and 3 despite a corrupt copy existing on disk. A week later when you accessed the data again, you of course re-read it from the disk, and ended up with the corrupt data that had been written originally. This makes many assumptions and relies on a bit of bad luck. If this is the case you'll just have to restore the backup file and move on.

Those are the only two things I can think of, really. It doesn't sound like an issue with the drives themselves. In any case since there's no way to tell where in the hardware the error occurred, I recommend running a full memory diagnostic just to be safe, although more likely the cause was unfortunate EMI or cosmic rays. As Canadian Luke mentioned in his answer, ECC RAM, if your motherboard supports it, will protect against this type of event, at least on the RAM side. It's actually not uncommon at all.


This case ended up being the OP's problem, rather than the second possibility.

Solution 2:

Check your ram. File systems like zfs recommend ecc memory (error checking and correcting memory) to prevent issues like this.

Your raid controller simply wrote the information it was given, and assumed it got correct information. It doesn't check that the information it receives is correct, just that it was written properly.

Your motherboard may not support ecc memory, but that should correct this issue. It's also possible that it was a solar flare that changed that byte on you, as the more ram you have, the greater the chance of an anomaly happening.

I'm out of town on my cell, but I can cite sources on Monday