What happened to ECC RAM?

Solution 1:

A decade or two ago I could buy ECC (Error Correction Code) RAM for PCs I assembled. ECC RAM provided SEC-DED, I guess from bit flips caused by ionizing radiation (I don't know what else could cause transient bit errors to pop up in RAM or I/O buses).

There are 3 general causes of bit errors, the first two of which are single event upsets:

  1. Radiation (primarily free neutrons). This particular phenomenon is dependent on a number of things such as the neutron cross section of the particular device. It may seem counter-intuitive, but the newer much smaller geometries have a lower probability of an upset due to neutrons because they have been designed to be less susceptible. See the Xilinx link (from below).

  2. Lead, specifically Pb210 which is part of the Uranium decay chain and is found in older kit in the balls of BGA devices. Xilinx refers to errors from this as the alpha rate as they emit an alpha particle during decay. Clearly not an issue for a great deal of current equipment that is lead free (but still quite an issue in aerospace where tin lead processing is still common).

  3. General bit error rate issues. A memory interface is a communication channel, and all communications channels have an error rate. Admittedly, you may never see a single bit error in the life of a particular piece of equipment as this is a statistical quantity. Errors due to electrical noise and poor device decoupling also fall into this category.

i.e., if ECC RAM was considered a useful feature a decade ago, do the reasons it was useful no longer apply to current personal computers and servers? Or is the thinking now that ECC RAM was never actually useful?

It was useful, but of limited value, although many side channel attacks can be mitigated by its use.

The real reason you can't find it in commercially available boards is simply cost and those boards that do have it have a rather large premium, far higher than the delta cost of the silicon to handle it and the extra 8 data bits (for a 64 bit memory system). The cost-benefit analysis doesn't support its broad availability.

I do remember a research paper from Boeing that discussed soft errors in a Denver data centre. The amount of free neutrons is (up to a certain level) proportional to altitude. The higher you go, the more there are.

If ECC memory was helpful twenty years ago presumably it would be more helpful now that PCs are running with 1-2 orders of magnitude more memory, at lower voltages and with smaller physical features that (presumably) are more susceptible to corruption from stray radiation. Are any of these assumptions incorrect?

The memory interfaces we have today are far more robust than you might think; for DDRx, the data strobes are differential (so they reject common mode noise) and lower transition voltages are actually better for high speed interfaces, as we proved years ago with ECL.

In avionics, and in particular flight safety critical avionics such as flight control computers, the use of ECC for L2 and beyond is mandatory as is the use of parity for L1. That is one of the reasons those cards are not from Intel or AMD.

[Update]. The specifics of just how memory cells are laid out has a rather large effect on their susceptibility to SEUs; Xilinx has taken a particular approach that effectively stacks memory cells in such a way that the probability of a high energy neutron causing a bit flip is significantly reduced.

As I am not an IC designer that is all I can really say. There is a great deal more information at the Rosetta Project.

Solution 2:

15+ years ago Intel decided ECC RAM support was not of value in consumer machines.

In other words, Intel decided 15+ years ago that consumer machines don’t need it. Thus the market doesn’t support it outside of server hardware. Thus end consumers are paying the price.

This January 2021 article in ExtremeTech provides a fairly solid summary of what happened: “Linus Tovalds Blames Intel for Killing ECC RAM in Consumer Systems”:

“There was a time when you could buy ECC support on mainstream chipsets, but Intel phased out that capability on non-Xeon platforms a number of years ago. The 975X may have been the last consumer Intel platform to support it, and that family launched 15 years ago. The Xeon 3450 chipset was cross-compatible with certain high-end CPUs in the Nehalem family, but that’s still a Xeon chipset — not a mainstream part.”

“As a result, support for ECC in consumer products — and the availability of ECC RAM for consumer products — both fell off a cliff.”

Since the article quotes Linus Torvalds, here is his specific complaint:

“The memory manufacturers claim it’s because of economics and lower power. And they are lying bastards – let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an ‘attack’, when it always was ‘we’re cutting corners.’”

The issue here is Linux is getting blamed for kernel errors, but Linus Torvalds believes the root cause are hardware issues that can be traced to the prevalence of non-ECC RAM in machines nowadays.

But that is a tangent… What it comes down to is PC manufacturers cutting corners. Classic manufacturing issue.

And nowadays where PC hardware is considered pretty disposable, there might be some rationale here: RAM starts to get flaky, just toss the machine and buy a new one. The truth is the market is filled with non-techs and non-PC builders so hey… It stinks but it is what it is.

Solution 3:

I agree with the answer provided by @Giacomo1968 as far as history goes. The current state however is changing. AMD has recently started to support ECC memory in their current desktop CPU line for the AM4 socket: "ECC is not disabled. It works, but not validated for our consumer client platform." (Source: Reddit)

That said, the motherboard also needs to support this. Some consumer boards do, some don't.