Rsync triggered Linux OOM killer on a single 50 GB file

So let us read the oom-killer output and see what can be learned from there.

When analyzing OOM killer logs, it is important to look at what triggered it. The first line of your log gives us some clues:

[kernel] [1772321.850644] clamd invoked oom-killer: gfp_mask=0x84d0, order=0

order=0 is telling us how much memory is being requested. The kernel's memory management is only able to manage page numbers in the powers of 2, so clamd has requested 20 pages of memory or 4KB.

The lowest two bits of the GFP_MASK (get free page mask) constitute the so-called zone mask telling the allocator which zone to get the memory from:

Flag            value      Description
                0x00u      0 implicitly means allocate from ZONE_NORMAL
__GFP_DMA       0x01u      Allocate from ZONE_DMA if possible
__GFP_HIGHMEM   0x02u      Allocate from ZONE_HIGHMEM if possible

Memory zones is a concept created mainly for compatibility reasons. In a simplified view, there are three zones for an x86 Kernel:

Memory range   Zone       Purpose 

0-16 MB        DMA        Hardware compatibility (devices)
16 - 896 MB    NORMAL     space directly addressable by the Kernel, userland 
> 896 MB       HIGHMEM    userland, space addressable by the Kernel via kmap() calls

In your case, the zonemask is 0, meaning clamd is requesting memory from ZONE_NORMAL.

The other flags are resolving to

/*
 * Action modifiers - doesn't change the zoning
 *
 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 * _might_ fail.  This depends upon the particular VM implementation.
 *
 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.
 *
 * __GFP_NORETRY: The VM implementation must not retry indefinitely.
 */
#define __GFP_WAIT      0x10u   /* Can wait and reschedule? */
#define __GFP_HIGH      0x20u   /* Should access emergency pools? */
#define __GFP_IO        0x40u   /* Can start physical IO? */
#define __GFP_FS        0x80u   /* Can call down to low-level FS? */
#define __GFP_COLD      0x100u  /* Cache-cold page required */
#define __GFP_NOWARN    0x200u  /* Suppress page allocation failure warning */
#define __GFP_REPEAT    0x400u  /* Retry the allocation.  Might fail */
#define __GFP_NOFAIL    0x800u  /* Retry for ever.  Cannot fail */
#define __GFP_NORETRY   0x1000u /* Do not retry.  Might fail */
#define __GFP_NO_GROW   0x2000u /* Slab internal usage */
#define __GFP_COMP      0x4000u /* Add compound page metadata */
#define __GFP_ZERO      0x8000u /* Return zeroed page on success */
#define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
#define __GFP_NORECLAIM  0x20000u /* No realy zone reclaim during allocation */

according to the Linux MM documentation, so your requst has the flags for GFP_ZERO, GFP_REPEAT, GFP_FS, GFP_IO and GFP_WAIT, thus being not particularly picky.

So what's up with ZONE_NORMAL? Some generic stats can be found further on in the OOM output:

[kernel] [1772321.850770] Normal free:8056kB min:8048kB low:10060kB high:12072kB active_anon:0kB inactive_anon:0kB active_file:248kB inactive_file:388kB unevictable:0kB isolated(anon) :0kB isolated(file):0kB present:890008kB

Noticeable here is that free is just 8K from min and way under low. This means your host's memory manager is somewhat in distress and kswapd should be swapping out pages already as it is in the yellow phase of the graph below: Linux memory manager graph

Some more information on the memory fragmentation of the zone is given here:

[kernel] [1772321.850795] Normal: 830*4kB 80*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 8056kB

basically stating that you have a single contiguous page of 4MB with the rest heavily fragmented into mainly 4KB pages.

So let's recapitulate:

  • you have a userland process (clamd) getting memory from ZONE_NORMAL whereas non-privileged memory allocation usually would be performed from ZONE_HIMEM
  • the memory manager should at this stage have been able to serve the requested 4K page, although you seem to have significant memory pressure in ZONE_NORMAL
  • the system, by kswapd's rules, should have seen some paging activity beforehand, but nothing is being swapped out, even under memory pressure in ZONE_NORMAL, without apparent cause
  • None of the above gives a definite reason as for why oom-killer has been invoked

All of this seems rather odd, but is at least to be related to what is described in section 2.5 of John O'Gorman's excellent "Understanding the Linux Virtual Memory Manager" book:

As the addresses space usable by the kernel (ZONE_NORMAL) is limited in size, the kernel has support for the concept of High Memory. [...] To access memory between the range of 1GiB and 4GiB, the kernel temporarily maps pages from high memory into ZONE_NORMAL with kmap(). [...]

That means that to describe 1GiB of memory, approximately 11MiB of kernel memory is required. Thus, with 16GiB, 176MiB of memory is consumed, putting significant pressure on ZONE_NORMAL. This does not sound too bad until other structures are taken into account which use ZONE_NORMAL. Even very small structures such as Page Table Entries (PTEs) require about 16MiB in the worst case. This makes 16GiB about the practical limit for available physical memory Linux on an x86.

(emphasis is mine)

Since 3.2 has numerous advancements in memory management over 2.6, this is not a definite answer, but a really strong hint I would pursue first. Reduce the host's usable memory to at most 16G by either using the mem= kernel parameter or by ripping half of the DIMMs out of the server.

Ultimately, use a 64-bit Kernel.

Dude, it's 2015.


A few things ...

My rule of thumb for swap space has been to have at least 2x the amount of physical ram. This allows the page/swap daemon the ability to reorg memory efficiently.

Server_B has 32GB of ram, so try configuring it for 64GB of swap. IMO, the 2GB of swap space your server has is way too low, especially for a server.

If you don't have an extra partition that you can make into a swap partition, you can test this by creating a file and mounting it as a swap partition [it will be slow]. See https://www.maketecheasier.com/swap-partitions-on-linux/

Since server_B has plenty of disk space, --inplace is not needed, and may be undesired as it may be what is causing rsync to use 32GB. --inplace is only really useful if you're short on filesystem space [which you aren't] or have some special performance requirement.

My guess is that rsync will want to use 50GB of ram [the file size] with your current options. Normally, rsync doesn't need that much memory to do its job, so one or more of your options may be the problem. I routinely transfer 200GB files without a problem.

Do some test runs using no options. Do this with smaller files, say 10GB--this should prevent the kernel panic, but still allow you to monitor the behavior that is causing the problem. Monitor rsync's memory usage.

Gradually, add back options, one at a time, to see which option [or combination of options] causes rsync to start pigging out on RAM (e.g. while the transfer is happening, rsync's ram usage grows proportionately to the amount of file data transferred, etc.).

If you truly need the options that cause rsync to keep some in-ram file image, you'll need the extra swap space and your maximum file size will be limited accordingly.

A few more things [UPDATED]:

(1) The kernel stack traceback shows that rsync was page faulting on an mmap area. It is probably mmaping the file. mmap offers no guarantee that it will flush to disk until the file is closed [unlike read/write] which goes to the FS block cache right away [where it will be flushed]

(2) The kernel crash/panic is occurring when the transfer size hits the size of RAM. Clearly rsync is grabbing that much non-fscache memory via malloc or mmap. Once again, with the options you've specified, rsync will allocate 50GB of memory to transfer a 50GB file.

(3) Transfer a 24GB file. That will probably work. Then, boot the kernel with mem=16G and do the 24GB file test again. It will blow out at 16GB rather than 32GB. This will confirm that rsync really needs the memory.

(4) Before you say that adding swap is ridiculous, try adding some [via the swap-to-file method]. This is far easier to do and test than all the academic arguments about how swap isn't required. Even if it's not the solution, you may learn something from it. I'll bet that mem=16G test will succeed without a panic/crash.

(5) Chances are that rsync is hitting swap, but it happens too fast to see with top before OOM kicks in and kills rsync. By the time rsync gets to 32GB, other processes have already been forced out to swap, particularly if they're idle. Perhaps, a combination of "free" and "top" will give you a better picture.

(6) After rsync gets killed, it takes time to flush mmap to the FS. Not fast enough for OOM and it starts killing other things [some are obviously mission critical]. That is, the mmap flush and OOM are racing. Or, OOM has a bug. Otherwise, there wouldn't be a crash.

(7) In my experience, once a system "hits the memory wall", Linux takes a long time to fully recover. And, sometimes it never really recovers properly and the only way to clear it is a reboot. For example, I have 12GB of RAM. When I run a job that uses 40GB of memory [I have 120GB of swap to accommodate large jobs] and then kill it, it takes about 10 minutes for the system to return to normal responsiveness [with the disk light on solid all the while].

(8) Run rsync without options. This will work. Get a baseline example to work from. Then add back --inplace and retest. Then do --append-verify instead. Then, try both. Find out which option gets rsync doing the huge mmap. Then decide if you can live without it. If --inplace is the culprit, that's a no-brainer, since you've got plenty of disk space. If you must have the option, you'll have to get the swap space to accommodate the malloc/mmap that rsync will do.

SECOND UPDATE:

Please do the mem= and smaller file tests from the above.

The central questions: Why does rsync get killed by OOM? Who/What is chewing memory?

I read [but forgot] about the system being 32 bit. So, I agree, rsync may not be directly responsible (via malloc/mmap--glibc implements large mallocs via anonymous/private mmaps), and rsync's mmap page fault just triggers OOM by coincidence. Then, OOM calculates total memory consumed by rsync directly and indirectly [FS cache, socket buffers, etc.] and decides it's the prime candidate. So, monitoring total memory usage may be helpful. I suspect it creeps up at the same rate as the file transfer. Obviously, it shouldn't.

Some things you can monitor in /proc or /proc/rsync_pid via a perl or python script in a fast loop [a bash script probably won't be fast enough for the end-of-the-world event] that can monitor all of the following several hundred times/sec. You can run this at a higher priority than rsync so it will keep itself in RAM and running so you can monitor things just before the crash and hopefully during OOM so you can see why OOM goes crazy:

/proc/meminfo -- to get more fine grain on the swap usage at "point of impact". Actually, getting the final number on how much RAM is being used total may be more useful. While top provides this, it may not be fast enough to show the state of the universe just prior to the "big bang" (e.g. the last 10 milliseconds)

/proc/rsync_pid/fd directory. Reading the symlinks will allow you to identify which fd is opened on the target file (e.g. readlink of /proc/rsync_pid/fd/5 --> target_file). This probably only needs to be done once to get the fd number [it should stay fixed]

Knowing the fd number, look at /proc/rsync_pid/fdinfo/fd. It's a text file that looks like:

pos: <file_position>
flags: blah_blah
mnt_id: blah_blah

Monitoring the "pos" value can be helpful as the "last file position" may be useful. If you do several tests with varying sizes and mem= options, does the last file position track any of these [and how]? The usual suspect: file position == available RAM

But, the simplest way is to start with "rsync local_file server:remote_file" and verify that works. You may be able to get similar [but faster] results by doing "ssh server rsync file_a file_b" [you'd need to create a 50GB file_a first]. A simple way to create file_a is scp local_system:original_file server:file_a and this might be interesting unto itself (e.g. does this work when the rsync crashes? If scp works, but rsync fails, this points to rsync. If scp fails, this points to something else like the NIC driver). Doing the ssh rsync also takes the NIC out of the equation, which may be helpful. If that hoses the system, then something is really wrong. If it succeeds, [as I've mentioned] start adding back the options one-by-one.

I hate to belabor the point, but adding some swap via swap-to-file may change/delay the crash behavior and may be useful as a diagnostic tool. If adding, say 16GB, of swap delays the crash [as measured by memory usage or target file position] from 32GB to 46GB, then that will say something.

It may not be any particular process, but an errant kernel driver that is chewing memory. The kernel's internal vmalloc allocates stuff and it can be swapped out. IIRC, it's not bound by addressibility under all circumstances.

Clearly, the OOM is getting confused/panicked. That is, it kills rsync, but doesn't see the memory freed up in a timely manner and goes looking for other victims. Some of them are probably critical to system operation.

malloc/mmap aside, this could be caused by unflushed FS cache that takes a long time (e.g. with 30GB of unflushed data, assuming a disk rate of 300 MB/sec, it might take 100 seconds to flush it). Even at that rate, OOM may be too impatient. Or, OOM killing rsync doesn't start the FS flush fast enough [or at all]. Or FS flush happens fast enough, but it has a "lazy" release of the pages back to the free pool. There are some /proc options you can set to control FS cache behavior [I can't remember what they are].

Try booting with mem=4G or some other small number. This might cut back on the FS cache and shorten its flush time to keep OOM from looking for other things to kill (e.g. flush time is reduced from 100 sec to < 1 sec). It might also unmask an OOM bug that can't handle physical ram > 4GB on a 32 bit system or some such.

Also, an important point: Run as non-root. Root users are never expected to chew resources, so they are given more forgiving limits (e.g. 99% of memory vs 95% for non-root users). This may explain why OOM is in such a state. Also, this gives OOM et. al. more headroom to do its job of reclaiming memory.


clamd? It sounds like you're using ClamAV and have on-access scanning enabled where the anti-virus engine attempts to scan opened files for viruses, by loading into memory, the entire contents of every file opened by any other process.

Depending on your security posture and the necessity of this transfer, you should evaluate disabling ClamAV on-access scanning while you perform the transfer.