Recover from a Punctured RAID array

here is my situation.

I have a Dell Server with a Dell Perc 7i controller, (LSI Controller).

I had a drive give me an Failure Predicted warning so I called their support and they came out and replaced the drive and the array rebuilt itself, pretty standard.

Two weeks later, I have another drive giving me the Failure Predicted warning. I figured maybe it was a bad batch of drives or coincidence, etc. So I contact support and look more in-depth. I realize that there were bad blocks on one of the other drives that didn't fail and those bad blocks were copied over during the rebuild. So now I have bad blocks all over the place and they are slowly killing my array. I have come to find that this is called a Punctured Array.

So their advice was to replace all the drives, rebuild the array, and restore from backup. Except i've been having this issue for a few weeks which means my backups are bad...and if I restore from a backup from prior (a month ago) then I will be missing about 4 weeks worth of data from my database which is totally unacceptable for our office.

My question is...has anyone ever recovered from something like this without having to lose data or without the whole (throw it all out the window and start over) approach ?

I did find one link that covered my scenario, not sure if it sheds any light on the situation : http://www.theprojectbot.com/raid/what-is-a-punctured-raid-array/

Any help or direction would be appreciated ! What do you guys think?


Solution 1:

Your system I assume is still up, so the best thing to do is make an immediate backup, dump the disks/array, rebuild, and restore from the backup.

Bad blocks don't always mean your backups are also bad. If you haven't experienced any performance problems or damaged files, then your backups should still be complete enough to finish a restore.

To test, take your most recent backup and examine your most important data. If it's still intact, you likely have a good backup.

At this point, there is a risk involved as you cannot be 100% certain that your backups are good or that backing up now won't cause file loss. However, your array will eventually fail and force a restore anyway, so this is your only real option.

Solution 2:

Right this instant, do the following:

  • Stop rotating backups or deleting old ones for this system. You want to keep all of the backups you currently have.
  • Take a full backup of the server.

Hopefully the disks are still good enough that your data is intact, and you won't encounter any problems running the new full backup.

Then scrap those disks, and build a new RAID array. Once that's ready, try to restore from the backup you took just now. With any luck, that'll be all you need to do.

If that fails, try the next oldest, and the next oldest, etc. Be sure to test the functionality of the system - just because it boots, doesn't mean it's fully operational. Particularly, test the databases for corruption.

If you had to restore the entire system from an older backup, that's ok. Take the newest backups, and restore just the database files and other important files. Test them to make sure they work properly. Again, if that fails, try the next oldest.

Using this process minimizes the data loss.