Does it make sense to install online spare memory?

It's not worth it. With ECC RAM and running HP management agents, it's pretty easy to detect bad memory. There are typically a few steps to intervene before you see a major problem that affects operation. Under standard support, RAM replacement is next-business-day, so there's no need to complicate your RAM arrangement by adding spare DIMMs.

The worst HP ProLiant memory issue I had on a system eventually crashed the server after several ECC alerts that occurred over the course of a week. The errors came, the server rebooted through an ASR and the machine came back up with the bad DIMM disabled. This was an HP ProLiant DL580 G4 system and the error logs were as follows...

0004 Repaired       22:21  12/01/2008 22:21  12/01/2008 0001
LOG: Corrected Memory Error threshold exceeded (Slot 1, Memory Module 1)

0005 Repaired       20:41  12/06/2008 20:43  12/06/2008 0002
LOG: POST Error: 201-Memory Error Single-bit error occured during memory initialization, Board 1, DIMM 1. Bank 
containing DIMM(s) has been disabled.

Back in the day, I installed many HP ProLiant DL740 servers that featured a RAID5-style memory array. So a 16GB RAM server actually had 20GB installed in hot-swappable banks of 8 DIMMS. For the dozens of those servers that I deployed and ran for 5+ years, I only had one DIMM module fail. Figures...

Edit:
You're planning to use this in a high-frequency trading environment. You asked about latency with spare RAM in a server like this. Typically, for low-latency applications, I disable the memory pre-failure checks on my host systems. This is the recommendation from HP on page 7 of their Configuring the HP ProLiant Server BIOS for Low-Latency Applications white paper. It's a matter of monitoring and risk. I rarely have DIMMs fail. Do you care more about speed or resiliency? You won't get both at the hardware level...


I think this is just wasting money. The memory already has ECC. That being said, if your server will be used 24/7 and can never have downtime then this might make sense. If you are using this for a hypervisor, then it will be simple to move off all VMs, power down the system, and swap out a bad memory chip.

In my experience, high-end server memory chips do go bad every now and then and need replacing.


It's a very easy thing for you to decide - how much will it cost to enable online-spare mode or lock-step mode and then consider how much the impact of memory-based losses of service would cost over the lifetime of the server.

We don't use either of these methods in our servers where those servers at part of an existing failover cluster - Oracle RAC, vSphere etc. but DO use it where our servers cannot be clustered in any practical/economic way.

Only you can decide based on the cost/benefit but the technology does work, I know for a fact that we've avoided two full system outages on one of our servers over the last 2.5 years and for us the investment was worth it, your mileage may vary.