Hardware RAID controller cache battery failure frequency/lifetime?
I'm in an environment that contains many Supermicro servers equipped with Adaptec and LSI MegaRAID hardware RAID controllers. These controllers contain battery-backed cache modules to help boost write performance and protect data in-transit.
A frequent support issues is RAID controller battery failure. This shifts the array from write-back to write-through mode. There's clearly a negative performance impact as the system runs with degraded write speed. This persists until a downtime window can be established to power the system down and replace the battery.
This is a very routine operation for us; almost weekly across several thousand physical servers... We even have charging stations in place to prep replacement batteries so that can be swapped-in without a charge cycle.
Perhaps I'm spoiled by a long history with HP ProLiant servers and Smart Array RAID controllers, but HP systems typically had battery lifetimes of 4-6 years. They eventually eliminated the use of RAID batteries around 2009. They were replaced with supercapacitor-backed memory modules (flash-backed write cache, or FBWC) and don't require replacement, disposal or a lengthy initial charge cycle.
Since I see the Adaptec and LSI controller battery failures sometimes occurring on systems that have been in service for less than 12 months, I wonder if this is common in other environments.
If this is common, how do other large server environments handle this?
- Any tips or tricks to handling RAID battery replacements?
- Are there any configuration parameters that can help?
- How disruptive is this to operations in your environment?
- Could poor chassis cooling and temperature be a factor?
- Are we doing something wrong?
- Dell PERC controllers are made by LSI. Do Dell environments experience the same short battery lifetimes?
LSI product literature outlining a new-generation battery that can last longer in service than 1 year.
HP ProLiant DL585 G2 server with 1000+ day uptime and a happy RAID battery...
# uptime
05:38:08 up 1031 days, 44 min, 31 users, load average: 0.49, 0.64, 0.99
# hpacucli
Cache Board Present: True
Cache Status: OK
Accelerator Ratio: 50% Read / 50% Write
Total Cache Size: 512 MB
Battery Pack Count: 1
Battery Status: OK
I suspect your Supermicros are broken one way or the other - possibly the battery packs are overheating. Most recent LSIs would report the temperature through MegaCLI - you might want to monitor this value on servers which needed replacement.
root@host:~/SOLARIS# ./MegaCli -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: BBU
[...]
Temperature: 41 C
I have seen a couple of Dell and Fujitsu systems with LSI BBU controllers, none of them had yearly battery pack replacement (except you screwed the pack up by deep-discharge). The typical life time has been around 3 to 5 years.
Average battery life should be 3-5 years. And don't forget that flash-based FBWC fails also. I don't know why/how, but we were replacing them fairy regularly on our HP servers. I should last longer than the battery, but I don't have statistics from our individual servers.
The standard way to prevent effects of failed battery and battery learning is to have multiple batteries. This is how HP storage (like HP EVA) have it. You have 2 hot-plug batteries and while one is low charge or being replaced, controller works with the remaining one. I'm no sure if it is possible to have multiple batteries connected to SmartArray, but hpacucli
diag
output suggest it should be supported:
Battery 1 firmware is up to date. Battery 2 not present. Battery 3 not present. Battery Status: Battery 1 Battery 2 Battery 3 --------------- --------- --------- --------- Present: YES NO NO Responding: YES N/A N/A PIC Revision: 52 . . Status: 0x80 . . Extra Status: 0x01 . . Enabled: FALSE . . Charging: FALSE . . Good: TRUE . . Open: FALSE . . Shorted: FALSE . . Sample Err: FALSE . . Control: 0x00 . . Load Current: (0x70) 24.6mA . . Per Memory Chip: 4920uA . . Voltage: (0xae) 5640mV . . Capacity: 100% . . Depletion count: 0x00 . .
My experience with IBM versions of the LSI platforms over a few hundred installs is that the average battery barely makes 2yrs, and supercap isn't any better, some of which can be fixed with a firmware update, but LSI just haven't got it right. I have had about 75% supercap failures in the first 2 yrs.