Non-volatile cache RAID controllers: what kind of protection is there against NVCACHE failure?
The battery back-up (BBU) model:
- admin enables write-back cache with BBU
- writes are cached to the RAID controller's RAM (major performance benefit)
- the battery saves uncommitted and cached data in the event of a power loss (reliability)
If I lose power and come back within a day or so, my data should be both complete and uncorrupted.
The downside to this is that, if the battery is dead or low, OR EVEN IF IT IS IN A RELEARN CYCLE (drain/charge loops to ensure the battery's health), the controller reverts to write-through mode and performance will suffer. What's more, the relearn cycles are usually automated on a schedule which may or may not happen in the middle of big traffic. So, that has to be manually disabled and manually scheduled for off-hours if it's a concern. Annoying either way.
NV caches have capacitors with a sufficient charge to commit any uncommitted-to-disk data to flash. Not only is that more survivable in longer loss situations, but you don't have to concern yourself with battery death, wear-out, or relearning.
All of that sounds great to me. What doesn't sound great to me is the prospect of that flash module having an issue, though. What if it's completely hosed? What if it's only partially hosed? A bit corrupted at the edges? Relearn cycles can tell when something like a simple battery is failing, but is there a similar process to verify that the flash is functional? I'm just far more trusting of a battery, warts and all.
I know the card's RAM can fail, the card itself can fail - that's common territory, though.
In case you didn't guess, yeah, I've experienced a shocking-to-me amount of flash/SSD/etc. failure :)
Solution 1:
You're over-thinking this.
Of course, this depends slightly on the manufacturer's specific implementation, but having deployed thousands of HP ProLiant servers over 10 years, I've experienced hundreds of RAID controller battery failures. I replaced the bad units, knowing that sudden power-loss or a system crash would result in some level of data corruption if I didn't have a healthy battery in place.
I was happy to see the move to flash-backed write caches in recent years. The flash units on HP ProLiant systems are a separate super-capacitor that attaches to the controller RAM module. I suppose they can fail. I've not experienced one yet. The HP Smart Array RAID controllers can be set to keep write caching enabled regardless of the battery/supercap health. This presumes facility protection against sudden power-loss. You still have to worry about application stability and system crashes.
It sounds as though you're referring to Dell PERC controllers and their NVCACHE implementation. It's a similar design. Dell explains in their guidebook...
4.5.1 Non-Volatile Cache
Dell PERC controllers with non-volatile (NV) cache use the standard battery as contained in the Dell
PERC controllers with a battery back-up unit (BBU). The difference is in battery implementation:
- The battery in the BBU offering retains the data in cache in the event of a power cycle for a
guaranteed period of 24 hours (typically up to 72 hours).
- The battery in the NV cache offering will transfer the data from cache to flash in the event of
a power cycle, where the data will be retained for up to ten years.
Think about your application and your storage access patterns. Are you really writing to the array fast enough and with an amount of data that cannot be flushed to disk effectively? Is your application unable to recover from a crash or sudden reboot?
If you're really concerned about application availability, focus on protecting facility power (healthy UPS + generator) and bolstering your systems with redundant components (power supplies, fans, etc.)
Edit:
I'm looking at an HP Smart Array P410 RAID controller with a flash-backed write cache onboard. There are health LEDs for the flash module and the older external battery (HP equipment doesn't do the relearn cycle).
A dedicated battery microcontroller continuously monitors the HP Smart Array battery pack for signs of damage, including an open battery terminal, partial battery short, charge timeouts, and over discharge conditions.
For the super-capacitor, its health is monitored, but the LED indicator is located on the flash module. The RAM is ECC error-correcting, so that's also another level of defense. Both are reported to the host server, via SNMP traps and can be viewed through diagnostic utilities.
From HP's Smart Array technology guide.
The Super-cap sub-assembly consists of two 35-Farad 2.7V capacitors, configured in series, providing 17 Farads at up to 5.4V. The charger maintains the Super-cap at 4.8V, providing the required amount of power to complete backup operations while extending the life of the Super-cap. The charger monitors Super-cap health and activates LED indicators on the FBWC module to warn of impending failure. The Super-Cap module uses the same form factor and housing as the HP 650 mAh P-Series battery used in the HP BBWC.
My point with all of this is that the manufacturers have engineered solutions to make the flash cache solution work and become a viable replacement for the older battery-based technology. It's in their interest to provide proper monitoring facilities.
As a note, check the visual indicators for the HP's newest-generation flash-modules. You can be sure that all of these checks are integrated into the alerting and diagnostics system for the controller.
Solution 2:
Presumably, the server itself would fail to boot if the BIOS on the RAID controller encountered a failure during tests. It would check the onboard memory just as the main server BIOS checks its own memory. If you want details on this, your best bet is to call the manufacturer of your RAID card.