Why did our raid array fail?

Solution 1:

Multiple drives failing in quick succession is not as rare as people seem to think. Failures tend to follow what's called a Bathtub curve - a high initial rate as manufacuring defects get stressed to failure, dropping to a relatively low rate for the typical lifetime of the drives and then rising again as things wear out as they pass their design lifetimes. Drives are mechanical and server drives are running constantly.

When one drive fails another failure is still only slightly more likely but such failures are typically followed by increased stress often, somewhat paradoxically, caused by the RAID rebuild process which forces the drives to carry out quite a lot of intense IO.

Finally SMART does not have a good reputation for being a reliable indicator of reliability, there is some benefit but overall it's not great - there's some very good long term study results from Google on this which you can find here (Failure Trends in a Large Disk Drive Population).

The basic message is that when you are running a RAID pack for a long time you take an increasing risk that is higher than many expect (the number of reports of multiple drive failures here is testament to that). The second message is that RAID is something to use to increase availability on average, but always make sure you have an acceptable backup strategy in case you are one of those who gets unlucky.

Solution 2:

The G3 is pretty old now, I think you're seeing the other side of the MTBF bell-curve.

Solution 3:

Have you checked your environmental monitoring records? Any power or cooling events?