How to instruct linux not to swap out hot pages of mmaped files?
I have a server where I run worker processes that mmap serveral rather big read-only "dictionary" files (~8GB total). Tests showed that they actively access only around ~1GB worth of the pages in these files. On the same server I run another process that sequentially reads a huge file, merges some updates to it and writes the result in a new version of this huge file. There in no other major activity on the server besides this "merger" process and "worker" processes. So normally the workers should be CPU-bound and the merger should be disk-bound. But what I see is that the workers are constantly choking on major page faults. The merger uses around 20GB of RSS and the machine has 48GB. There are 4 workers. They have 2GB RSS each and only 600MB shared (instead of expected 1GB of hot pages). Somehow rest of the memory is mostly used by fs cache. Is there a way to "prioritize" the hot pages of my mmaped files into memory? I tried madvise(MADV_WILLNEED) but it doesn't seem to help. Maybe there is a solution with cgroups or sysctls?
$ free total used free shared buffers cached Mem: 49324064 48863392 460672 0 22520 25409896 -/+ buffers/cache: 23430976 25893088 Swap: 0 0 0 $ uname -a Linux dev-kiwi02 3.2.0-25-server #40-Ubuntu SMP Fri May 25 13:12:35 UTC 2012 x86_64 GNU/Linux
P.S. Asked this on StackOverflow already but looks like ServerFault is more appropriate.
What you probably need is mlock(), not madvise(). Madvise is too 'weak'; mlock locks the memory in the kernel. Assuming you have enough RAM and only lock the 'hot' pages (not the whole 8GB) that shouldn't be a problem for your setup.
Another solution that may seem counterintuitive: disable swap. Your machine has 48GB; subtract 4 workers, shared data and your OS and you have still > 35 GB left. You write that your merger reads a file sequentially and inserts a few entries; therefor I assume you don't need to keep the big file in memory but can write it out sequentially as well; you only need to load all your updates in memory which shouldn't be a problem.