BBWC: in theory a good idea but has one ever saved your data?

Sure. I've had battery-backed cache (BBWC) and later flash-backed write cache (FBWC) protect in-flight data following crashes and sudden power loss.

On HP ProLiant servers, the typical message is:

POST Error: 1792-Drive Array Reports Valid Data Found in Array Accelerator

Which means, "Hey, there's data in the write cache that survived the reboot/power-loss!! I'm going to write that back to disk now!!"

An interesting case was my post-mortem of a system that lost power during a tornado, the array sequence was:

POST Error: 1793-Drive Array - Array Accelerator Battery Depleted - Data Loss
POST Error: 1779-Drive Array Controller Detects Replacement Drives
POST Error: 1792-Drive Array Reports Valid Data Found in Array Accelerator

The 1793 POST error is unique. - While the system was in use, power was interrupted while data was in the Array Accelerator memory. However, due to the fact that this was a tornado, power was not restored within four days, so the array batteries were depleted and data within was lost. The server had two RAID controllers. The other controller had an FBWC unit, which lasts far longer than a battery. That drive recovered properly. Some data corruption resulted on the array backed by the empty battery.

Despite plenty of battery runtime at the facility, four days without power and hazardous conditions made it impossible for anyone to shut the servers down safely. enter image description here

Yes, had that case.

Server "without UPS" in a data center (with the data center having a UPS). PDU failure - system crashed hard. No data loss.

And that basically is it. The good thing about a BBWC is that it is in the machine. Have a UPS - believe me, sometimes someone does something stupid (like pulling the wrong cable). A UPS is external. Oh, THAT cable ;)

I've had 2 cases where battery backed cache in HW RAID controllers failed completely (in 2 separate companies).

BBC relies on the unsurprising idea that battery works. The catch is that at some point battery in controller fails and what's devastating is that in many HW raid controllers it fails silently. We thought we had a cache protected against power loss but we did not.

On power loss the RAID array data loss was so extensive that all disk contents were rendered unrecoverable. Everything was lost. One of the cases involved a machine dedicated entirely for testing, but still.

After that I said "never again", switched to software-based disk mirroring (mdadm) in Linux + journal-based fs that has decent resilience against power loss (ext4) and never looked back. Granted, I've used it on servers that did not have extremely high IO usage.

BBWC: in theory a good idea but has one ever saved your data?

Related

Recent Posts