HP ProCurve 5412zl warm-boots on power failure while attached to UPS
My client's HP ProCurve 5412zl chassis switch reboots on occasion, despite being powered through four redundant power supplies and being under UPS protection.
These reboots usually happen during a real power outage or during a brown-out or low-voltage event. All of the equipment attached to the UPS stays up except for the switch.
The UPS for the rack is an APC SmartUPS SUA3000XL 208V with step-down transformer. This switch provides PoE for phones and access points throughout the facility. The battery cells are healthy, replaced recently and have a full charge.
These blips have the effect of rebooting all of the phones in the facility and disconnecting users from their sessions. It's disruptive.
In the switch logs:
Keys: W=Warning I=Information
M=Major D=Debug E=Error
---- Event Log listing: Events Since Boot ----
I 02/17/16 22:26:31 03802 chassis: System Self test started on Master
I 02/17/16 22:26:31 03803 chassis: System Self test completed on Master
I 02/17/16 22:26:35 00061 system: -----------------------------------------
I 02/17/16 22:26:35 00062 system: Mgmt Module 1 went down without saving crash
information
M 02/17/16 22:26:35 03001 system: System reboot due to Power Failure
And version information:
valley-core# sh version
Image stamp: /ws/swbuildm/rel_orlando_qaoff/code/build/btm(swbuildm_rel_orlando_qaoff_rel_orlando)
Nov 19 2014 15:17:26
K.15.16.0005
335
Boot Image: Secondary
For years, I didn't realize that you have to modify the power supply settings on this switch model, but this unit is configured properly to take advantage of the multiple PSUs.
valley-core# sh power-over-ethernet
Status and Counters - System Power Status
System Power Status : Full redundancy
PoE Power Status : Full redundancy
Chassis power-over-ethernet:
Total Available Power : 600 W
Total Failover Power : 600 W
Total Redundancy Power : 600 W
Total Used Power : 359 W +/- 6W
Total Remaining Power : 241 W
Internal Power
Main Power
PS (Watts) Status
----- ------------- ---------------------
1 300 POE+ Connected
2 300 POE+ Connected
3 300 POE+ Connected
4 300 POE+ Connected
External Power
EPS1 /Not Connected.
EPS2 /Not Connected.
Additional PSU information:
valley-core# sh system power-consumption
Slot Power Usage:
Slot Module Description Current Power
----- ----------------------------------------- ---------------
A HP J9534A 24p Gig-T PoE+ v2 zl Module 18 W
B HP J9536A 20p GT PoE+/2p SFP+ v2 zl Mod 23 W
C HP J9534A 24p Gig-T PoE+ v2 zl Module 18 W
D HP J9534A 24p Gig-T PoE+ v2 zl Module 19 W
E HP J9534A 24p Gig-T PoE+ v2 zl Module 17 W
F HP J9534A 24p Gig-T PoE+ v2 zl Module 18 W
G HP J9534A 24p Gig-T PoE+ v2 zl Module 18 W
H HP J9534A 24p Gig-T PoE+ v2 zl Module 18 W
K HP J9534A 24p Gig-T PoE+ v2 zl Module 18 W
L HP J9534A 24p Gig-T PoE+ v2 zl Module 19 W
valley-core# sh system power-supply
Power Supply Status:
PS# Model State AC/DC + V Wattage
---- --------- ------------- ----------------- ---------
1 Unknwn Powered AC 120V 875
2 Unknwn Powered AC 120V 875
3 Unknwn Powered AC 120V 875
4 Unknwn Powered AC 120V 875
4 / 4 supply bays delivering power.
Total power: 3500 W
What's unique is that the switch is the only device losing power. None of the connected servers have power issues, despite being on the same battery or PDU.
I can admit that the power in this location is poor and suffers from voltage dips and the occasional spike. But the UPS didn't even log a fault during this recent warm-boot.
I have another 5412zl at an unrelated customer that has done the same thing multiple times in the past.
Any thoughts on what I can do about this? Should I try to move two of the PSUs to utility power instead of all being on the UPS?
Edit:
Boot history shows:
valley-core# sh boot-history
Mgmt Module 1 -- Saved Crash Information (most recent first):
=============================================================
ID: 29008d6a
Active system went down: 02/01/16 09:23:54 K.15.16.0005 335
Switch rebooting due to temporary loss of power or low voltage
ID: 994a405a
Active system went down: 12/14/15 11:31:15 K.15.16.0005 335
switch rebooting due to temporary loss of power or low voltage
An HP change note on a previous firmware revision says:
Power (CR_0000112424) - When the switch is exposed to AC power fluctuations and the voltage drops too low, the switch reboots and generates an incorrect error message saying the switch crashed. With this fix, the error message is changed to "Switch rebooting due to temporary loss of power or low voltage".
This is consistent with this tech note.
Solution 1:
My initial and immediate thoughts are along the lines of what you're contemplating. If these blips are occurring independent of any self-test schedules you have set up on the UPS (if the blips happen some percentage of the time while ON a self-test, then you have either a UPS/transformer/load problem), I'd do exactly what you're suggesting. Move a couple of the PSUs to a different feed, and see if blips recur. If they do - and I'm not suggesting this lightly - open a case with HP. It may be a painful, tedious process. However, they can likely help provide guidance to get real debugging info out of the switch. I'd also take a moment to check the release notes/buglists for the current rev of firmware on the switch, too.
Solution 2:
According to this page, your UPS series is of the "line interactive" type. This designation means that it isn't constantly converting the utility power to DC and back to mains level again. Rather, it's just sitting there monitoring the power and keeping its batteries charged. Input power is passed straight through, although it may be passed though a few chokes and a surge protection device along the way for extra safety.
When the utility power goes down or has a voltage dip, the UPS needs to switch its inverter into the circuit to start supplying battery power to the connected equipment. Regardless of how this switching is done (it's going to be either a physical or a solid-state relay), you will always see a "gap" of a few milliseconds. Also, the UPS's inverter probably won't be in phase with the utility power, so the AC waveform jumps to the new phase.
Most equipment doesn't really care if the incoming power is lost for a few milliseconds. The capacitors in the power supply are often large enough to ride over small gaps without a problem. I've seen many servers and network equipment take a couple of complete missed cycles without so much as a glitch.
My suspicion would be that this particular switch's PSUs are a bit more critical than most. I'd think your problem could be solved by getting another UPS (which is continuously in the loop converting AC-DC-AC) to run the switch off of. This type of UPS is often referred to as "online", although you should check with your vendor to confirm you're getting the right type.
Solution 3:
With the info you just added in the edit it is pretty clear.
2 possible causes come to mind:
1)
The UPS when it is actually needing to do the work slightly drops it's output voltage and the rate of change is steep enough to make the switch think it has a low power condition.
I have seen that happen with UPS units before.
The only remedy is to take some load of the UPS or get a bigger UPS.
In some cases: If the UPS has multiple outgoing circuits, re-distributing the load on those may help. Ideally each circuit should more or less have the same load to it. This minimizes voltage-drop on the outputs.
2)
Another possibility, though quite rare, also applies to UPS units with multiple outputs. It could be the outputs are not exactly in sync considering the phase of the AC they provide.
If the PSUs of you switch hook up to several circuits with a phase difference the power-board inside the switch that combines the power of its PSUs may have trouble synchronizing and cause the same problem.
In that case the solution is exactly opposite: Put everything on the same circuit.