RAM tests inconsistently - what is the most likely culprit? (i.e. what should I spend money on replacing)

Solution 1:

This doesn't sound like any component is defective, rather you are using an incompatible combination.

Having multiple sockets on the same memory bus populated increases the capacitance on each data line and slows down the rise time, which can cause transitions to arrive late and be misdetected. This phenomenon is known to electrical engineers as "fan-out".

This is further complicated because of the fan-out internal to a memory module. The number and topology of the DRAM devices on the module, called "rank", will affect how many modules you can successfully connect in parallel.

Server motherboards supporting a lot of memory sockets actually require buffered memory, which uses a cascading network of buffers to limit the fan-out (and therefore capacitance) seen by each one. There's delay caused by the buffers themselves, but it only increases logarithmically with the number of loads, whereas for unbuffered memory capacitance increases linearly.

Wikipedia discusses this: https://en.wikipedia.org/wiki/Memory_rank

Some motherboard manuals actually call this sort of thing out. For others you can deduce the information from the RAM compatibility lists. As an example, the ASUS Z170-A motherboard shows that dual rank (called DS = double sided in the manual) can only be used in two slots at once on that board, as opposed to the ability to use four single rank DIMMs at once.

enter image description here

Solution 2:

That sounds like an issue in the processor's integrated memory controller.

In modern systems, motherboards don't really play a role in memory management beyond just providing a path between the memory modules and the processor. Memory is directly connected to the processor to minimize latency; the "northbridge" that connects the memory to the processor in older systems is now part of the processor itself. (The firmware or PCH may control how the processor runs the RAM, but it doesn't make sense for it to cause bit errors of the sort you describe as it's ultimately the responsibility of the processor.) Hence, the very first thing I'd suspect in a situation like this is a faulty IMC.

In fact, I'd be very surprised if the motherboard or system firmware were to blame for the problems you're experiencing.

Solution 3:

I see some bad reviews for the BIOS on that motherboard. I would start by checking for a BIOS update. Never skimp on the motherboard.

Solution 4:

It's possible that the RAM could be faulty as well, even though it may not appear to be. I had a recent issue with my home server involving a fatal mishap with some iced tea...

I went through the entire process of replacing each part individually (2 CPUs, mobo, powersupply, and 2 banks of 16 GB (2x8GB) RAM) and everything tested fine when I just used a single bank of RAM with a single CPU (except for 1 CPU which was toast).

It didn't matter which configuration I used, it always worked when I had a single CPU and bank of RAM (whether it was 16GB or 32GB of RAM), but when I put in the 2nd CPU and split the RAM so it was 16GB per bank, the server failed to boot.

It wasn't until I replaced one bank of RAM completely that it finally booted and ran properly, and has been ever since.

tl;dr: As @moab stated in his comment, you can never tell for certain until you test every component in a compatible system