Is it safe to mark a disk ok, in a degraded RAID 5 array?
Intel Matrix Storage Console 8.9 showed a degraded array with one disk failure. Yet it offers the option to mark the disk as ok and rebuild the array? When would it be appropriate to do this? Does it assess disk failure incorrectly? Why offer this option?
This is a test server, and I have backups, so am not terribly concerned and tried marking the disk as ok, and it rebuilt the volume without indicating a further problem.
BUT is there a problem anyway?
Additionally...
The great responses make me wonder, what the best methods to test the disk might be. SMART tests are mentioned below. Probably I will remove the drive, rebuild with a new one.
It still seems unclear to me whether a volume can rebuild and not show errors, as appears to have happened already with this existing drive?
Drives can be marked as failed in an array for many reasons. Maybe there's a few defective sectors. Maybe the drive heads are failing. Maybe cosmic rays hit your drive at the right angle and time to fail a scan. Maybe their firmware has a bug that breaks under .
Some of these are reparable failures, some aren't.
The thing is, it's really hard to predict hard drive failures. Google's infamous paper found that SMART was only useful in that if it alerted, the drives were more likely to fail than if it didn't. Fully 36% of the failed drives had no SMART errors, fatal or not. So you could run a full suite of SMART scans, find none, and know no more than you do now.
But, assuming this was an out-of-the-blue failure and not an I-did-something-funny-and-it-failed failure, you already have an indication of problems with the disk. Now it's a question of value.
- How much does another drive cost?
- How much time would be lost for its users if this server died?
- How much of your time would be lost if this server died?
- How much is all that time worth?
- Double this value to account (naively) for opportunity cost
I've never been in a situation where it was worth letting a drive fail. Why go through the pain? Chances are, the drive you need is pretty cheap. Just buy it and move on.
I once had a faulty caddy in an old U160 SCSI array, that was one of 14 disks in the array. When I replaced the caddy (the disk was fine), it still thought it was failed because the disk had the same serial number.
So I marked it as OK, the array re-built and all was fine until we de-comissioned it.
It all depends on your situation, but normally I would never mark a disk as OK unless I was 100% certain that it was OK. Even at 99.9% certain, I would delete the array and start again.
If you care about the data, replace the drive immediately with a new one and rebuild the array. You can then run extensive testing on the removed drive and requalify it for use if it passes. However, if you try to rebuild the failed drive in place, you are extending the time you are vulnerable to a double-drive failure should something go wrong during or after the rebuild process.