Dirty page accounting in Linux kernel through /proc/$PID/smaps

Solution 1:

Now consider the statement
static char page1[PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE))) = {'c'}
Here, the loader will load the values for page1 at init of the program, and mark the page as RW.

You seem to believe that the loader does a write to memory for this statement, but it does not.

What happens in this case is not mmap RW + write of the byte 'c'. That byte is already embedded in your executable at compile time, so the only thing that happens is a mmap RW, nothing more. Something like this:

mmap(0, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd_of_your_elf, offset_of_data_section);

Or, most likely, just mmap(...entire file...) followed by a series of mprotect() with the right permissions for the different sections of the ELF.

Actually, it is not even the loader that does this, but rather it is the kernel itself that maps the executable into memory in this case, assuming you are launching your program as ./exe. The loader only maps the program by itself when it is invoked as /path/to/loader ./exe. See also this other answer of mine where I have a little bit more detailed explanation.

How on earth did the kernel got to know I wrote the page?

As you probably already know, when your program is initially mapped in memory (including the page containing page1), even though the mapping for that page is RW, there is no real need for the kernel to actually allocate memory for the page until a read or write occurs. This technique is known as demand paging. Initially (right after it is mapped) the page is not even present in the page table of your process: it only exists as one of the many vm_area_struct entries in the memory map of your task.

When a page fault occurs (caused by a read or a write) the kernel then decides what to do based on the nature of the mapping. In this case the mapping is file-backed (the actual initial value for the whole page1 array was written in your ELF file at compile time), so the two possible scenarios are as follows:

When a memory read happens, a page fault happens and the page content is read from the file into memory. The newly allocated page is now marked as read only, even though it was mapped as RW (the kernel still knows that this VMA is RW).
When a memory write happens, there are two cases: either (A) the page was already present in memory because of a previous read (and is marked RO), or (B) the page wasn't in memory at all because this is the first memory access to it. In both cases, a page fault happens, the kernel checks if writing is allowed (yes it is), and copy-on-write takes place.

Since the page was file-backed, but not shared (i.e. not mapped with MAP_SHARED), the data does not need to be written back to the file, so the kernel simply allocates a new anonymous page and either copies the content over from the previous page (case A) or reads the page from the file into memory (case B) before applying the write. This is why you see Anonymous go from 0kB to 4kB.

Additionally, since the old page only had one user before copy-on-write, it can be deallocated, and this is why you see Rss stay the same. Finally, before the write the page was clean (not dirty), and after the write it becomes dirty (not clean), so this is why you see Private_Clean and Private_Dirty change values.

Dirty page accounting in Linux kernel through /proc/$PID/smaps

Solution 1:

Related

Recent Posts