How to get notified of mdadm RAID problems?

Solution 1:

What could cause the disks to suddenly become out of sync?

It could be any hardware or software fault in the path between the drive platters and the data in memory. Which could mean, but is not limited to: drive head, drive controller, connecting head on the cable, the cable itself (internal wire break), the port the cable plugs into on the drive, the port on the motherboard or daughter-card, the controller chip on the motherboard or daughter-card, or even a failure in software (somewhere).

True story: I once had a RAID mirror that was flaky, dropping a drive for no reason. The drives checked out fine, the platters were clean (repeat SMART passes turned up nothing), and everything worked well - until it would flake out again, and again. I replaced the $3 SATA cable and the issues instantly went away. Moral of the story: there's a LOT that can go wrong, and you can't always assume that "everything is fine" if you don't check every component in the path of the data.

Why was I not notified by email?

Email notification only occurs when (a) actively monitoring the array, or (b) when the array is interrogated.

My advice is: you need to have mdadm actively monitor the drive array as a process. This can be accomplished with something similar to (but not exactly like):

mdadm --monitor --scan --syslog

You will need to adjust the above line to your specific installation.

Why was the error not properly logged to syslog before halting the system? Could it be that the system tried to log to syslog, but did so after stopping the syslog daemon? If so what can I do to prevent that?

There could have been a variety of issues that caused the logging to be dropped.

First, there is the entire issue of how syslog works in general; and while many years have gone into making it robust and reliable, there are certain edge cases where data may not make it to disk. This is a well-known design issue and one that was actively addressed with supervision-styled service management (aka daemontools and their ilk). The solution there was to bypass syslog altogether and write the output to a logger that had an open file descriptor at all times, so nothing would get dropped, and the logger would dump the output to disk as fast as possible; while it is not a 100% effective solution it does significantly improves the odds of having events written to the drive before a kernel panics or shuts down.

Second, there is the possibility that the kernel had an outright panic, or some other event occured that would force the machine into a corner. Even faulty hardware could cause an issue - I've seen machines with underpowered PSUs cause spontaneous shutdowns in Windows 8. A replacement of the PSU fixed the shutdown problem permanently. Obviously, nothing the kernel can do will guard against a machine that just decided "I've had enough of this" and toddled off to reboot-land.

What can I do to find out what happened? Or, if there's no way for me now to find out what happened, how can I improve logging and notifications so that next time I can do a better post-mortem?

There are several approaches:

  • Place logging on a separate partition. While this is not a guarantee that you will get intact logs, it does help with isolating filesystem issues, such as disk-full-can't-write, corruption that causes a remount to read-only, etc. It certainly helps in those specific cases.

  • Look at remote logging vital system information. Again, this is not a guarantee but it will help if the last packet can "make it out the door" before a reboot happens, and that packet has critical clues to why the reboot happened.

  • For specific, critical services, look at replacing output to syslog with something else, such as supervision-styled logging, where a dedicated logger intercepts output and writes it to disk as soon as possible. This increases the reliability of the output making it to storage. With a little work, it can be made to co-exist side-by-side with other service management arrangements.

Solution 2:

What could cause the disks to suddenly become out of sync?

Drive failure, controller failure, some other hardware failure. Some obscure software issue.

Why was I not notified by email?

Ubuntu has a cronjob /etc/cron.d/mdadm that results in the RAID volumes being checked once a day at 00:57. If your system wasn't having problems then, or it had already failed by then, then there was no way to send a message.

Why was the error not properly logged to syslog before halting the system?

Well, if drives are failing it doesn't really make sense to try to write to them since any further writes could trash whatever is left. Not knowing the exact nature of your failure, it could be that your volume or filesystem went read-only. By default Ubuntu is setup to switch to a read-only filesystem if there are errors on the root volume.

how can I improve logging and notifications so that next time I can do a better post-mortem?

Setup logging to a remote syslog host. That way a storage failure doesn't mean that nothing can be logged.