Linux Software Raid runs checkarray on the First Sunday of the Month? Why?

It looks like Debian has a default to run checkarray on the first Sunday of the month.

This causes massive performance problems and heavy disk usage for 12 hours on my 2TB mirror. Doing this "just in case" is bizzare to me. Discovering data out of sync between the two disks without quorum would be a failure anyway.

This massive checking could only tell me that I have an unrecoverable drive failure and corrupt data. Which is nice, but not all that helpful. Is it necessary?

Given I have no disk errors and no reason to believe my disks have failed, why is this check necessary? Should I take it out of my cron?

/etc/cron.d# tail -1 /etc/cron.d/mdadm
57 0 * * 0 root [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ] && /usr/share/mdadm/checkarray --cron --all --quiet

Thanks for any insight,


Solution 1:

Since it sounds like you are running RAID1, I agree that you don't need the check in your situation, but I disagree with some of the reasons given by the first answerer.

1) RAID is an UPTIME/ACCESS SPEED solution, not a backup solution. Having the RAID fail shouldn't mean data loss as you shouldn't be using it for that.

2) I'm curious why you think mirroring the entire drive is "Inefficient." Why add complexity and have to rely on the computer not missing something when you can just mirror everything?

3) "Risky because in case of disk failures, rebuilding a mirror or parity disk for large and active arrays can take days - in this interval if another disk from degraded array fails it means data loss." As opposed to what, keeping everything on a single disk? RAID isn't perfect but it does mean that you can survive an entire drive dying without losing access to your data, and can REBUILD without losing access to your data. Also, on anything OTHER than RAID1, the periodic testing can detect a drive that is becoming bad (it keeps track of individual block failures for a particular drive, and also uses SMART data) and can flag it as failed BEFORE you lose access to the data. Immediate, catastrophic drive failure is not the only data loss scenario.

Solution 2:

Checking a RAID1 still makes sense, and doing it on a regular basis should keep your data significantly safer.

This seems counter-intuitive, until you dig deeper into how drives tend to fail. I agree that if the two disks are out of sync, that the checking won't do any good. But what if one of the disks has a sector that has recently failed? Well, reading from that drive during the check will produce a read error for that drive, correct? This is valuable information for the RAID driver, because it knows from the mirror information stored on the second drive what should be stored in the failed sector. Thus, the RAID driver will now try to rewrite the failed sector (even in check mode, which inexperienced users assume is read only). This rewrite may work or not, but modern disks all have spare sectors that will replace a failed sector upon write (and not upon a read - upon a read it will report just a read failure). Thus, by a combination of the RAID driver rewriting the sector, and the hard drive reallocating a spare sector for the failed one, the RAID array is being fixed on the fly. The RAID driver actually does not know (and doesn't need to know) that reallocation occurred. That is done within the drive itself, and if properly configured (see smartctl), the operating system can send an email to the admin to tell him that a sector was reallocated, meaning that it's time to replace the slowly failing disk. Modern large disks have a tendency to produce these "pending sector" read errors, for example due to temperature fluctuations. Using them in a RAID array will significantly improve reliability, and running regular checks will ensure that questionable sectors will be automatically refreshed when they have read problems. These refresh writes may actually result in a successful write operation on the sector, in which case a "pending sector" does not become a "reallocated sector", or in other words, the disk isn't really bad at all because it's able to write the sector now.

Anyway, by using smartctl and doing regular checks, you will keep your RAID array a lot more reliable. The comments about zfs (e.g zfs3 or z3) are important as well. As drive sizes increase, and as information is written more densely and the engineering of drives is more driven by consumer markets rather than server markets, the overall risk for data loss per storage unit increases dramatically. Running a RAID5 on drives several TB in size is risky, due to the long rebuild times and the need to extensively read from all other drives during a rebuild. Consider a RAID6 with two spares if you want to protect yourself against the frequency of catastrophic data loss failures (yes, it's just the probability, and you need to consider other factors as well, such as controller failures, power outages ... there is a lot you have to do to balance the reliability of many different components). And even RAID6 may be hundreds of times less reliable than having three parity drives, as in RAIDZ and/or Z3.