How do you check the health of individual hard drives in a RAID array?

Solution 1:

Typically, what you wan is a package called smartmontools. It can query the SMART interface on your disks, which is in most modern disks.

There is a daemon called smartd which can help you with continuous monitoring.

However, if your system is a home server, just checking manually is often better. Like so:

smartctl -a /dev/sda

A lot of data spews forth. The stuff that most interest me are the following:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       13946
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   075   066   000    Old_age   Always       -       25
194 Temperature_Celsius     0x0022   075   064   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

This gives you a way to measure the drive health subjectively. When the error rate starts going up, its time to look for a replacement. Also, you can check that they are not running hot.

Solution 2:

Something like "mdadm --query --detail /dev/md0" should work, but when the drive actually fail, the root will receive an e-mail (it's the standard config on Centos and i believe on other distros as well). Just check that notification by failing (like: mdadm --manage /dev/md0 --fail /dev/sda1), and You will be 100% sure.