Can Linux scrub memory?

Does Linux have a mechanism to "scrub" memory? e.g. testing the memory and marking areas as dirty if they fail so that the system can continue to operate "safely" even with bad ram chips installed?!


Solution 1:

The answer is yes, and it is done transparently (provided you have ECC memory to detect errors, and your kernel version is at least 2.6.30 to continue to operate safely).

Basically, your memory is checked at every read from the processor, and scrubbed periodically*, to check for consistency with the Error Correcting Codes (ECC). If an error happened, you get a Machine Check Exception, which is then logged and grabbed by mcelog (http://www.mcelog.org/).

If your error was correctable, it increments a "leaky bucket" counter, which causes a physical DIMM that fails too often to be transparently replaced by another one. Thus your memory page is copied to a new location, your virtual memory address is updated to point to the new page, and the old page is marked by the OS as not to be used anymore.

This is called "soft-offlining" on Linux (and memory page retirement on Solaris, I don't know about other OSs).

If your error was not correctable however, what is called "hard-offlining" happens, that is your memory page gets removed from the normal operating system memory management, and your application gets killed (NB : by some catchable SIGBUS signal that tells you where the error happened, but it's rare enough not to care and try to catch it). If your memory page is mapped from a file and clean, the OS can also reload it transparently at another physical location instead of killing the process.

You can read more on mcelog, there are plenty configuration options, you can get other behaviours to be triggered, options, and other leads on what to read and how to make sure mcelog is running on your system.


* Scrubbing, or "Patrol Scrubbing" consists in reading memory, checking it against ECC for errors, and overwriting with the corrected memory words when an error is discovered. The term patrol scrubbing is used by opposition to overwriting incorrect data on errors in memory reads, which is sometimes called "Demand Scrubbing". Scrubbing is a hardware procedure that can be enabled, usually through the BIOS.

Solution 2:

This is actually a bad idea. Memory cannot be reliably tested in a quick sweep. This is why software like memtest86 uses multiple passes with different bit patters to test memory. Solution:

  1. Test memory with memtest86, preferably long test, leave it running overnight, it will take a long time.

  2. If bad memory is detected, use memmap kernel parameter to force kernel not to use that memory:

   memmap=nn[KMG]$ss[KMG]
            [KNL,ACPI] Mark specific memory as reserved.
            Region of memory to be used, from ss to ss+nn.
            Example: Exclude memory from 0x18690000-0x1869ffff
                     memmap=64K$0x18690000
                     or
                     memmap=0x10000$0x18690000

In addition, you can use ECC memory which will correct 1-bit errors and detect 2-bit errors in your memory automatically (and you'll get log messages from kernel about uncorrectable memory problems if they happen)