File corruption when copying different file on raid 1

I have a RAID 1 configuration of 2 1TB drives on a Fedora 12 box. Most of what is stored there are video files that are numerical labeled. The problem I'm having is that I had one of the video files get corrupted. I copied a replacement from a backup and replaced the bad file and now it works fine. However, after doing this the next numbered file goes from 350MB to 200KB and all but about .5 second of video disappears. If I then replace that file it happens to the next one down the line.

Ex:

Replace corrupt file 1.avi and file 2.avi shrinks to 200KB.
Replace now corrupted 2.avi and it works but 3.avi gets screwed up.

I have run SMART tests on the drives and they report fine. Does anyone have any tests I can run to try to figure out what is going on?

EDIT: It is a two disk software RAID 1 with an ext4 filesystem


Solution 1:

I don't know what tests you're looking for that will tell you anything you don't already know.

The filesystem's corrupt.

Easiest solution is going to be to copy off the data to a different system (with a working filesystem), verify it, and then blow away the RAID on the existing system. After reformatting the drives individually and rebuilding the RAID, you should be good to go on the existing system again.

Had the same thing happen to me on a Server 2003 system that was using Server 2003's software RAID. A power failure or a system crash happened during a write to the array and the disks got out of sync, so the filesystem exhibited the same symptoms you're describing. (And likewise, all the tests I ran lied to me and said the disks and array were perfectly fine, even though it obviously wasn't.) Anything copied after a certain point on the array would get corrupted. The data would be valid for the first ~500KB of the file, and was junk after that. Images would display the top x% fine, and then be white at the bottom, documents would contain a few pages or rows of valid data and then be gibberish, etc. And if I added files "before" the corrupt point, the corruption would seem to move on to the "next" file, as if the corruption was offset ~500 KB from a certain point on the array.

If you can isolate which disk in the array contains the filesystem corruption you may be able to correct the issue by pulling out the disk with the corrupt data and forcing the array to rebuild from the good disk. (Assuming the corruption's only present on one disk, as it was for me.) That worked for me, and was how I recovered the data on my corrupt array. With just a 2 disk mirror RAID, you could even force a rebuild based on each disk, and see which array works, and which array ends up corrupt.

I still didn't trust it afterwards, so I copied the data off, nuked the array, reformatted the disks, reinstalled the OS and warned anyone that I'd kick them in their fun bits if I found them using software RAID on my network again. I'd recommend you do the same. Well, regarding the data, at least. Whether or not you want to abandon Linux software RAID and threaten your users is more of a personal preference.

Solution 2:

Sounds like you have bad filesystem troubles. Unmount the filesystem and run fsck -f on it to see. The -f flag tells fsck to run even though the filesystem appears to be clean.

Solution 3:

You might try Theodore Ts'o's debugfs.

http://linux.die.net/man/8/debugfs

You can use it to interactively debug your file-system. To see which inodes correspond to which files and so on.