Is there any such error logged by CentOS somewhere that can conclusively reveal "it is now time to pay for ECC"

I have a 32GB non-ECC RAM dedicated server with CentOS.

Once for day it randomly crashes without any error in /var/log/kern.log, /var/log/messages, mysql, apache.

CPU/RAM/IO are not particularly high nor low.

Is there any such error logged by CentOS somewhere that can conclusively reveal "it is now time to pay for ECC" ?


Solution 1:

What would you like it to log? CentOS can't know that the contents of non-ECC memory has become corrupt, because it's not knowable; it can only know that the contents of memory make no sense, and panic on the grounds of whatever self-inconsistency it found. That inconsistency might have arisen from RAM corruption, but it might also have arisen from a kernel bug, or some other cause.

The only way to know definitively that memory has become corrupt is to use memory that explicitly includes support for checking for such corruption; to wit, ECC memory.

Edit: that is a completely different question to the one you asked. But my strategy would be: run memtest86+ on the hardware, to see if there are any easy-to-catch repeatable errors, and enable remote syslogging on the server (as when the kernel panics, it often stops writing to the FS but can still squeeze a log message out the NIC), to see what's logged on the next panic.

Solution 2:

ECC memory has two advantages:

  • It is registered, meaning that there is a register before other components on the chip. This is supposed to remove electrical load from the memory controller. This is true of all RDIMMs, not just ECC RAM.
  • It can detect errors, and if not recover from them at least report that they happened

Given this, it is actually very difficult to determine whether you would have benefited from ECC ram without having ECC ram. By definition you cannot log the failure to detect an error, and you certainly don't have data on whether the error which may or may not have happened was the result of the memory controller messing up.

That said, if you run memtest, you will determine a couple things. If you find no errors, either you need ECC RAM, or the problem is with something else (so if you rule absolutely every piece of hardware and software out as the cause, you have shown the need for ECC RAM). If you find consistent errors, chances are the RAM is bad and just needs to be replaced. If you find inconsistent errors, the CPU might be bad, or you might need ECC RAM. If you find that memtest86 crashes, either the lowest-order DIMM is bad, or the CPU is bad, or you need ECC RAM.

Regardless, this is very tricky to definitely show. ECC RAM is most useful in applications where invisible errors in calculations are likely to cause extreme problems, or in applications where the sheer quantity of RAM combined with other conditions makes errors statistically likely. However, these criteria themselves are fuzzy and subjective, so it follows that there isn't really an objective criterion for this.