Invisible memory leak on Linux - Ubuntu Server (not disk cache/buffers!)
Solution 1:
My conclusion is it is a kernel memory leak somewhere in the Linux kernel, this is why none of the userspace tools are able to show where memory is being leaked. Maybe it is related to this question: https://serverfault.com/questions/670423/linux-memory-usage-higher-than-sum-of-processes
I upgraded the kernel version from 3.13 to 3.19 and it seems the memory leak has stopped! - I will report back if I see a leak again.
It would still be useful to have some easy/easier way to see how much memory is used for different parts of the Linux kernel. It is still a mystery what was causing the leak in 3.13.
Solution 2:
Story
I can reproduce your issue using ZFS on Linux.
Here is a server called node51
with 20GB
of RAM. I marked 16GiB
of RAM to be allocatable to the ZFS adaptive replacement cache (ARC):
root@node51 [~]# echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max
root@node51 [~]# grep c_max /proc/spl/kstat/zfs/arcstats
c_max 4 17179869184
Then, I read a 45GiB
file using Pipe Viewer in my ZFS pool zeltik
to fill up the ARC:
root@node51 [~]# pv /zeltik/backup-backups/2014.04.11.squashfs > /dev/zero
45GB 0:01:20 [ 575MB/s] [==================================>] 100%
Now look at the free memory:
root@node51 [~]# free -m
total used free shared buffers cached
Mem: 20013 19810 203 1 51 69
-/+ buffers/cache: 19688 324
Swap: 7557 0 7556
Look!
51MiB
in buffers
69MiB
in cache
120MiB
in both
19688MiB
of RAM in use, including buffers and cache
19568MiB
of RAM in use, excluding buffers and cache
The Python script that you referenced reports that applications are only using a small amount of RAM:
root@node51 [~]# python ps_mem.py
Private + Shared = RAM used Program
148.0 KiB + 54.0 KiB = 202.0 KiB acpid
176.0 KiB + 47.0 KiB = 223.0 KiB swapspace
184.0 KiB + 51.0 KiB = 235.0 KiB atd
220.0 KiB + 57.0 KiB = 277.0 KiB rpc.idmapd
304.0 KiB + 62.0 KiB = 366.0 KiB irqbalance
312.0 KiB + 64.0 KiB = 376.0 KiB sftp-server
308.0 KiB + 89.0 KiB = 397.0 KiB rpcbind
300.0 KiB + 104.5 KiB = 404.5 KiB cron
368.0 KiB + 99.0 KiB = 467.0 KiB upstart-socket-bridge
560.0 KiB + 180.0 KiB = 740.0 KiB systemd-logind
724.0 KiB + 93.0 KiB = 817.0 KiB dbus-daemon
720.0 KiB + 136.0 KiB = 856.0 KiB systemd-udevd
912.0 KiB + 118.5 KiB = 1.0 MiB upstart-udev-bridge
920.0 KiB + 180.0 KiB = 1.1 MiB rpc.statd (2)
1.0 MiB + 129.5 KiB = 1.1 MiB screen
1.1 MiB + 84.5 KiB = 1.2 MiB upstart-file-bridge
960.0 KiB + 452.0 KiB = 1.4 MiB getty (6)
1.6 MiB + 143.0 KiB = 1.7 MiB init
5.1 MiB + 1.5 MiB = 6.5 MiB bash (3)
5.7 MiB + 5.2 MiB = 10.9 MiB sshd (8)
11.7 MiB + 322.0 KiB = 12.0 MiB glusterd
27.3 MiB + 99.0 KiB = 27.4 MiB rsyslogd
67.4 MiB + 453.0 KiB = 67.8 MiB glusterfsd (2)
---------------------------------
137.4 MiB
=================================
19568MiB - 137.4MiB ≈ 19431MiB
of unaccounted RAM
Explanation
The 120MiB
of buffers and cache used that you saw in the story above account for the kernel's efficient behavior of caching data sent to or received from an external device.
The first row, labeled Mem, displays physical memory utilization, including the amount of memory allocated to buffers and caches. A buffer, also called buffer memory, is usually defined as a portion of memory that is set aside as a temporary holding place for data that is being sent to or received from an external device, such as a HDD, keyboard, printer or network.
The second line of data, which begins with -/+ buffers/cache, shows the amount of physical memory currently devoted to system buffer cache. This is particularly meaningful with regard to application programs, as all data accessed from files on the system that are performed through the use of read() and write() system calls pass through this cache. This cache can greatly speed up access to data by reducing or eliminating the need to read from or write to the HDD or other disk.
Source: http://www.linfo.org/free.html
Now how do we account for the missing 19431MiB
?
In the free -m
output above, the 19688MiB
"used" in "-/+ buffers/cache" comes from this formula:
(kb_main_used) - (buffers_plus_cached) =
(kb_main_total - kb_main_free) - (kb_main_buffers + kb_main_cached)
kb_main_total: MemTotal from /proc/meminfo
kb_main_free: MemFree from /proc/meminfo
kb_main_buffers: Buffers from /proc/meminfo
kb_main_cached: Cached from /proc/meminfo
Source: procps/free.c and procps/proc/sysinfo.c
(If you do the numbers based on my free -m
output, you'll notice that 2MiB
aren't accounted for, but that's because of rounding errors introduced by this code: #define S(X) ( ((unsigned long long)(X) << 10) >> shift)
)
The numbers don't add up in /proc/meminfo
, either (I didn't record /proc/meminfo
when I ran free -m
, but we can see from your question that /proc/meminfo
doesn't show where the missing RAM is), so we can conclude from the above that /proc/meminfo
doesn't tell the whole story.
In my testing conditions, I know as a control that ZFS on Linux is responsible for the high RAM usage. I told its ARC that it could use up to 16GiB
of the server's RAM.
ZFS on Linux isn't a process. It's a kernel module.
From what I've found so far, the RAM usage of a kernel module wouldn't show up using process information tools because the module isn't a process.
Troubleshooting
Unfortunately, I don't know enough about Linux to offer you a way to build a list of how much RAM non-process components (like the kernel and its modules) are using.
At this point, we can speculate, guess, and check.
You provided a dmesg
output. Well-designed kernel modules would log some of their details to dmesg
.
After looking through dmesg
, one item stood out to me: FS-Cache
FS-Cache
is part of the cachefiles
kernel module and relates to the package cachefilesd
on Debian and Red Hat Enterprise Linux.
Perhaps some time ago, you configured FS-Cache
on a RAM disk to reduce the impact of network I/O as your server analyzes the video data.
Try disabling any suspicious kernel modules that could be eating up RAM. They can probably be disabled with blacklist
in /etc/modprobe.d/
, followed by a sudo update-initramfs -u
(commands and locations may vary by Linux distribution).
Conclusion
A memory leak is eating up 8MB/hr
of your RAM and won't release the RAM, seemingly no matter what you do. I was not able to determine the source of your memory leak based on the information that you provided, nor was I able to offer a way to find that memory leak.
Someone who is more experienced with Linux than I will need to provide input on how we can determine where the "other" RAM usage is going.
I have started a bounty on this question to see if we can get a better answer than "speculate, guess, and check".
Solution 3:
Do you change the Swapiness of your Kernel manualy or disable it?
you can whatch you current swappyness-level with
cat /proc/sys/vm/swappiness
You could try to force your kernel to swap aggressively with
sudo sysctl -w vm.swappiness=100
if this decrease you problems find a good value between 1 and 100, fitting your requirement.