There's a low probability of complete chassis failure...

You'll likely encounter issues in your facility before sustaining a full failure of a blade enclosure.

My experience is primarily with HP C7000 and HP C3000 blade enclosures. I've also managed Dell and Supermicro blade solutions. Vendor matters a bit. But in summary, the HP gear has been stellar, Dell has been fine, and Supermicro was lacking in quality, resiliency and was just poorly-designed. I've never experienced failures on the HP and Dell side. The Supermicro did have serious outages, forcing us to abandon the platform. On the HP's and Dells, I've never encountered a full chassis failure.

  • I've had thermal events. The air-conditioning failed at a co-location facility sending temperatures to 115°F/46°C for 10 hours.
  • Power surges and line failures: Losing one side of an A/B feed. Individual power supply failures. There are usually six power supplies in my blade setups, so there's ample warning and redundancy.
  • Individual blade server failures. One server's issues do not affect the others in the enclosure.
  • An in-chassis fire...

I've seen a variety of environments and have had the benefit of installing in ideal data center conditions, as well as some rougher locations. On the HP C7000 and C3000 side, the main thing to consider is that the chassis is entirely modular. The components are designed minimize the impact of a component failure affecting the entire unit.

Think of it like this... The main C7000 chassis is comprised of front, (passive) midplane and backplane assemblies. The structural enclosure simply holds the front and rear components together and supports the systems' weight. Nearly every part can be replaced... believe me, I've disassembled many. The main redundancies are in fan/cooling, power and networking an management. The management processors (HP's Onboard Administrator) can be paired for redundancy, however the servers can run without them.

enter image description here

Fully-populated enclosure - front view. The six power supplies at the bottom run the full depth of the chassis and connect to a modular power backplane assembly at the rear of the enclosure. Power supply modes are configurable: e.g. 3+3 or n+1. So the enclosure definitely has power redundancy. enter image description here

Fully-populated enclosure - rear view. The Virtual Connect networking modules in the rear have an internal cross-connect, so I can lose one side or the other and still maintain network connectivity to the servers. There are six hot-swappable power supplies and ten hot-swappable fans. enter image description here

Empty enclosure - front view. Note that there's really nothing to this part of the enclosure. All connections are passed-through to the modular midplane. enter image description here

Midplane assembly removed. Note the six power feeds for the midplane assembly at the bottom. enter image description here

Midplane assembly. This is where the magic happens. Note the 16 separate downplane connections: one for each of the blade servers. I've had individual server sockets/bays fail without killing the entire enclosure or affecting the other servers. enter image description here

Power supply backplane(s). 3ø unit below standard single-phase module. I changed power distribution at my data center and simply swapped the power supply backplane to deal with the new method of power delivery enter image description here

Chassis connector damage. This particular enclosure was dropped during assembly, breaking the pins off of a ribbon connector. This went unnoticed for days, resulting in the running blade chassis catching FIRE... enter image description here

Here are the charred remains of the midplane ribbon cable. This controlled some of the chassis temperature and environment monitoring. The blade servers within continued to run without incident. The affected parts were replaced at my leisure during scheduled downtime, and all was well. enter image description here


I've been managing small numbers of blade servers for eight years now, and I've yet to have a system-wide failure that took a number of blades offline. I've come real close due to power-related problems, but haven't yet had a chassis-wide failure that wasn't attributable to outside sources.

Your observation that the chassis does represent a single-point-of-failure is correct, though they do build in a large amount of redundancies in them these days. All of the blade systems I've used have had parallel power feeds to the blades, and multiple network jacks going through separate paths, and in the case of Fibre-channel multiple paths from the blade to the back-of-rack optical ports. Even the chassis information system had multiple paths.

With appropriate network engineering (redundant NIC usage, MPIO for storage) single-problem events are entirely survivable. In my time with these systems I've had the following problems, none of which affected more than one blade if any:

  • Two power-supplies fail in the blade-rack. There was enough redundancy in the other 4 to support the load.
  • Loosing a phase for a 3-phase power-supply. These supplies are rare these days, but the other two phases had enough capacity to support the load.
  • Losing an inter-chassis management loop. It was like this for years before a vendor tech on another call noticed it.
  • Losing the inter-chassis management loops entirely. We lost management-console access, but the servers kept running as if nothing was wrong.
  • Someone accidentally rebooted the back-of-rack network backplane. Everything in that chassis was using redundant NICs so experienced no interruption of service; all the traffic moved to the other backplane.

TomTom's point about cost is very true though. To get to full cost-parity your blade chassis will have to be fully loaded and likely not using special things like back-of-rack switches. Blade-racks make sense in areas where you really need the density because you're space-constrained


That question could be extended to shared storage. Again I would say, that we need two storage units instead of only one - and again the vendors say, that this things are so rock solid, that no failure is expected.

Actually no. You concerns so far made sense, this sentence puts them into "read the stuff in front of your eyes". HA with full replication is a known enterprise feature for storage units. Point is that a SAN (Storage unit is a lot more complex than a blade chassis that at the end is just "stupid metal". Everything in a blade chassis except some backplanes is replaceable - all modules etc. are replaceable, and individual blades ARE allowed to fail. Noone says blade center in itself gives the blades high availability.

This is a lot different from a SAN which is supposed to be up 100% of the time - in consistent state - so there you have stuff like replication etc.

THAT SAID: watch your numbers. I have considered buying blades for some time now and they NEVER MADE FINANCIAL SENSE. The chassis are just too expensive and the blades not really cheaper compared to normal computers. I would suggest looking at the SuperMicro Twin architecture as alternative.