Linux with 256GB of mem / 48 Cores - Machine starts thrashing/choking with tons of memory left

Machine: Dell r815, CentOS 5.4, 256GB of RAM, 4 x 12 Cores.

We have an application that has a 275GB file. It does an in place sort on 20GB of data at a time i.e. it swaps bits around and replaces them in the same file. This all works fine.

There is a last pass that then reads through the entire file and does a merge sort on the different 20GB chunks, and outputs them to a whole new file.

This process SEEMS to run okay for a while and it ends up flushing out around 50GB to disk. Sometime after this, the WHOLE machine starts freaking out.

Simple commands like ps -ef, ls -al, hang for a long time and show up as taking 100% CPU (which is just one core).

Looking at the memory stats on top, I see that it is using around 120GB of RAM (so 128GB free) and has 120GB under the "cached" section.

Has anyone seen this kind of behavior before? The same process runs fine on a machine with 64GB of memory - so somehow I think it is related to the mount of RAM I have in the machine.

(as we speak, I am running the test on this machine with all but 64GB - to rule out a hardware issue).

Am I perhaps missing some vm params in /etc/sysctrl.conf?

Thanks!


Solution 1:

Your question reminded me of something I read recently:

http://jcole.us/blog/archives/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/

This addresses how NUMA architectures (like you might find in, say, a 48 core AMD system) affect memory allocation and swapping. I don't know if this is what you're running into but it sounded sufficiently similar that it may be worth a read.

Even if it's not the answer it makes for fascinating reading.

Solution 2:

So this appeared to be a kernel bug in 64bit Centos 5.4 AND 64bit Fedora 14. After I installed Centos 5.5, then problem went away.

Sorry I dont have a better answer for everyone...