From what you describe, the main problem is that they decided to use a RAID5 for such a large array, which is quite a bad choice for this setup, for exactly the reason you experience: Having a 2nd disk fail during the recovery breaks everything, and this second failure is all too likely to take that risk.

If they had used e.g. a RAID6 instead, having a 2nd disk fail during the recovery would not lead to a failed array and the recovery could proceed normally, at the cost of one disks worth of net storage capacity and a certain performance impact.

I can't see how leaving 15% free space would help at all with this problem, and while this might or might not be a good idea from a performance view point for the file system, this is clearly unrelated to the failing RAID. I call bullshit on that.

All that said, I can't help to wonder: Having this happening multiple times over the course of a few months appears to be too much even for a RAID5 system. I would suggest to look into the disk types used - it just might be your vendor used cheap desktop drives instead of 24/7 drives certified to be used in such a system.


I fully understand this is an old post, but as I continue to see large RAID5 arrays in production, I would like to add my thoughts here.

  • disks failing too often are generally a case of overheating and/or too much vibrations, which can be found on poorly-engineered systems or bad locations

  • such large RAID5 arrays should be strongly avoided. As a general rule, is much better to have a RAID6 array rather than a RAID5 + hotspare one. In the OP case, rather than having 1x parity disk with 2x global hotspares, it was much better to have 2x parity disk in a RAID6 configuration;

  • it is key to have a reliable system for error and status reporting: a unknowingly degraded, not monitored array is a recipe for disaster.