How to recover from a drive failure in a RAID 5 configuration?

This morning a drive failed on our database server. The drive array (3 disks) is setup in a RAID 5 configuration.

While we wait for a drive replacement we are preparing for a recovery strategy. Users are continuing to work on the system, albeit very slowly (don't know why??).

How does one install the new drive - will the data for this drive automatically be rebuilt from the parity or is there another process we should follow?

Edit: This is a hardware RAID controller. (Thanks for the answers so far, appreciated)


Solution 1:

The system is running very slowly because it has to reconstruct the missing data which involves additional CPU and I/O.

If you have a missing disk in a RAID-5 configuration you have no recovery strategy. If another disk goes down you will lose your data. Run, don't walk, to the nearest vendor from which you can get a compatible part covered by manufacturer's warranty shipped by a same-day urgent courier. If the vendor you bought the array from is already in the process of getting the part, get both parts and stash the other one away as a spare.

If you have a RAID-5 being used for a production system you should consider leaving a spare disk in the array as a hot spare.

Added - If your logs are not on a separate volume (physically separate disks) move them to a separate set of disks, even just a single mirrored pair. This will also be a performance win if your database has any significant load as contention on log volumes has a disproportionately bad effect on performance.

If this is possible you can also make your database more robust by doing the following:

  1. Shut down the database.
  2. Backup the database.
  3. Move the logs to a physically separate set of disks (make sure you reconfigure the database so it knows where the logs have been moved to).
  4. Restart the database and application.

If you have the logs on a separate volume you can restore and roll forward from the backup if and only if a disk failure does not compromise the logs. Database logs should be on a separate disk volume for (amongst others) the following reasons:

  • Logs usage patterns are predominantly sequential, appending log entries onto the end of the file (the file is in effect a ring buffer). This means that a large number of log entries can be written out quickly as there is little disk head seek activity.

  • If they are sharing physical disks with a heavily random access workload (e.g. a transactional tables and indexes) they will be slowed down disproportionately as the head seek activity disrupts the sequential writes.

  • Having the logs on a separate volume is almost always a performance win and only needs a single mirrored pair for logs to support quite a heavy workload. This means that the hardware to do it is quite cheap, so there is a small cost for a big performance and reliability win.

  • If your data array goes down the logs are not lost. If you have a proper backup strategy you can restore from the backup and roll foward from the logs. This means that a whole array can go down on the server without being a single point of failure. Both the log and data arrays have to fail simultaneously to cause data loss.

Solution 2:

1) Backup.

Right now no data has been lost. If your backups are not up to date backup now.

2) Read the manual, call the vendor etc.

Different RAID systems have different steps for replacing a disk, and done wrong you risk destroying the whole array. Without knowing what sort of RAID hardware/software you have we can only guess at the steps needed.

Also, the slow performance is because RAID 5 in a degraded state (i.e.: one disk dead) has horrible read performance. How horrible depends on how the parity is stored and which disk died, but the "good" news is slow performance with one disk gone is a known issue and not cause for panic.