Linux server ran out of memory stopped everything except ssh and swap was unused

Solution 1:

From the log output apache is trying to malloc a very large allocation. To figure out why you'll probably need to look at your apache setup and see what can do that (are you using mod_perl, mod_python, etc). If you can't find it that way you can put a proxy like nginx in front of apache and then nginx will log which request failed. If you want you can run nginx on the same host and just use the limits command to restrict apache so it gets killed before the oom killer is invoked (letting nginx continue, or urlsnarf, etc).

The flush PID is the kernel thread that handles syncing VFS cache buffers. Most likely related to the rm. You can verify by looking up the flush thread and which device its working for and verifying thats the device rm was working on:

# ps axwwu | grep flush
root     21658  0.0  0.0      0     0 ?        S    21:33   0:00 [flush-8:16]
# ls -l /dev/ | grep 8 | grep 16
brw-rw----  1 root disk      8,  16 2011-09-13 13:53 sdb

Solution 2:

Your system has run out of low memory (RAM which is directly mapped into the kernel virtual memory space). In a 32-bit Linux system the kernel can use at most about 700-800 MB of physical RAM for its internal data (and adding more RAM actually makes the situation worse, because more low memory is needed for the data structures used by the memory manager). The rest of RAM goes into the “high memory” zone, which can be used by userspace processes and the page cache, but is not usable for the kernel itself.

Excerpt from your logs:

DMA free:2908kB ... present:15808kB ...
Normal free:5376kB ... present:719320kB ... slab_reclaimable:664084kB slab_unreclaimable:11976kB ...
HighMem free:149216kB ... present:1336932kB ...

The DMA zone is the low 16 MB needed to work with obsolete ISA devices; these days it can mostly be ignored. Low memory is the DMA and Normal zones together; only this memory can be used to hold kernel data. The rest of memory goes to the HighMem zone.

Note that there was plenty of free memory in the HighMem zone, but the system still ran out of memory, because the memory was needed for some structures used by the kernel, and HighMem is not suitable for this. And most of the Normal zone was occupied by the slab cache (slab_reclaimable and slab_unreclaimable) — this is another kernel memory allocator, used to allocate blocks smaller than a page. Unfortunately, it is impossible to find out from your logs which slab allocations took so much memory (current slab cache usage statistics can be read from /proc/slabinfo; maybe you will want to set up some monitoring for this data).

The real problem might be some memory leak bug in the kernel; monitoring /proc/slabinfo should help to find out the reason if the problem happens again.