How to instruct linux not to swap out hot pages of mmaped files?

I have a server where I run worker processes that mmap serveral rather big read-only "dictionary" files (~8GB total). Tests showed that they actively access only around ~1GB worth of the pages in these files. On the same server I run another process that sequentially reads a huge file, merges some updates to it and writes the result in a new version of this huge file. There in no other major activity on the server besides this "merger" process and "worker" processes. So normally the workers should be CPU-bound and the merger should be disk-bound. But what I see is that the workers are constantly choking on major page faults. The merger uses around 20GB of RSS and the machine has 48GB. There are 4 workers. They have 2GB RSS each and only 600MB shared (instead of expected 1GB of hot pages). Somehow rest of the memory is mostly used by fs cache. Is there a way to "prioritize" the hot pages of my mmaped files into memory? I tried madvise(MADV_WILLNEED) but it doesn't seem to help. Maybe there is a solution with cgroups or sysctls?

$ free
             total       used       free     shared    buffers     cached
Mem:      49324064   48863392     460672          0      22520   25409896
-/+ buffers/cache:   23430976   25893088
Swap:            0          0          0

$ uname -a
Linux dev-kiwi02 3.2.0-25-server #40-Ubuntu SMP Fri May 25 13:12:35 UTC 2012 x86_64 GNU/Linux

P.S. Asked this on StackOverflow already but looks like ServerFault is more appropriate.


What you probably need is mlock(), not madvise(). Madvise is too 'weak'; mlock locks the memory in the kernel. Assuming you have enough RAM and only lock the 'hot' pages (not the whole 8GB) that shouldn't be a problem for your setup.

Another solution that may seem counterintuitive: disable swap. Your machine has 48GB; subtract 4 workers, shared data and your OS and you have still > 35 GB left. You write that your merger reads a file sequentially and inserts a few entries; therefor I assume you don't need to keep the big file in memory but can write it out sequentially as well; you only need to load all your updates in memory which shouldn't be a problem.