mmap, msync and linux process termination
I want to use mmap to implement persistence of certain portions of program state in a C program running under Linux by associating a fixed-size struct with a well known file name using mmap() with the MAP_SHARED flag set. For performance reasons, I would prefer not to call msync() at all, and no other programs will be accessing this file. When my program terminates and is restarted, it will map the same file again and do some processing on it to recover the state that it was in before the termination. My question is this: if I never call msync() on the file descriptor, will the kernel guarantee that all updates to the memory will get written to disk and be subsequently recoverable even if my process is terminated with SIGKILL? Also, will there be general system overhead from the kernel periodically writing the pages to disk even if my program never calls msync()?
EDIT: I've settled the problem of whether the data is written, but I'm still not sure about whether this will cause some unexpected system loading over trying to handle this problem with open()/write()/fsync() and taking the risk that some data might be lost if the process gets hit by KILL/SEGV/ABRT/etc. Added a 'linux-kernel' tag in hopes that some knowledgeable person might chime in.
Solution 1:
I found a comment from Linus Torvalds that answers this question http://www.realworldtech.com/forum/?threadid=113923&curpostid=114068
The mapped pages are part of the filesystem cache, which means that even if the user process that made a change to that page dies, the page is still managed by the kernel and as all concurrent accesses to that file will go through the kernel, other processes will get served from that cache.
In some old Linux kernels it was different, that's the reason why some kernel documents still tell to force msync
.
EDIT: Thanks RobH corrected the link.
EDIT:
A new flag, MAP_SYNC, is introduced since Linux 4.15, which can guarantee the coherence.
Shared file mappings with this flag provide the guarantee that while some memory is writably mapped in the address space of the process, it will be visible in the same file at the same offset even after the system crashes or is rebooted.
references:
http://man7.org/linux/man-pages/man2/mmap.2.html search MAP_SYNC in the page
https://lwn.net/Articles/731706/
Solution 2:
I decided to be less lazy and answer the question of whether the data is written to disk definitively by writing some code. The answer is that it will be written.
Here is a program that kills itself abruptly after writing some data to an mmap'd file:
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
typedef struct {
char data[100];
uint16_t count;
} state_data;
const char *test_data = "test";
int main(int argc, const char *argv[]) {
int fd = open("test.mm", O_RDWR|O_CREAT|O_TRUNC, (mode_t)0700);
if (fd < 0) {
perror("Unable to open file 'test.mm'");
exit(1);
}
size_t data_length = sizeof(state_data);
if (ftruncate(fd, data_length) < 0) {
perror("Unable to truncate file 'test.mm'");
exit(1);
}
state_data *data = (state_data *)mmap(NULL, data_length, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, fd, 0);
if (MAP_FAILED == data) {
perror("Unable to mmap file 'test.mm'");
close(fd);
exit(1);
}
memset(data, 0, data_length);
for (data->count = 0; data->count < 5; ++data->count) {
data->data[data->count] = test_data[data->count];
}
kill(getpid(), 9);
}
Here is a program that validates the resulting file after the previous program is dead:
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>
typedef struct {
char data[100];
uint16_t count;
} state_data;
const char *test_data = "test";
int main(int argc, const char *argv[]) {
int fd = open("test.mm", O_RDONLY);
if (fd < 0) {
perror("Unable to open file 'test.mm'");
exit(1);
}
size_t data_length = sizeof(state_data);
state_data *data = (state_data *)mmap(NULL, data_length, PROT_READ, MAP_SHARED|MAP_POPULATE, fd, 0);
if (MAP_FAILED == data) {
perror("Unable to mmap file 'test.mm'");
close(fd);
exit(1);
}
assert(5 == data->count);
unsigned index;
for (index = 0; index < 4; ++index) {
assert(test_data[index] == data->data[index]);
}
printf("Validated\n");
}
Solution 3:
I found something adding to my confusion:
munmap does not affect the object that was mappedthat is, the call to munmap does not cause the contents of the mapped region to be written to the disk file. The updating of the disk file for a MAP_SHARED region happens automatically by the kernel's virtual memory algorithm as we store into the memory-mapped region.
this is excerpted from Advanced Programming in the UNIX® Environment.
from the linux manpage:
MAP_SHARED Share this mapping with all other processes that map this object. Storing to the region is equiva-lent to writing to the file. The file may not actually be updated until msync(2) or munmap(2) are called.
the two seem contradictory. is APUE wrong?
Solution 4:
I didnot find a very precise answer to your question so decided add one more:
- Firstly about losing data, using write or mmap/memcpy mechanisms both writes to page cache and are synced to underlying storage in background by OS based on its page replacement settings/algo. For example linux has vm.dirty_writeback_centisecs which determines which pages are considered "old" to be flushed to disk. Now even if your process dies after the write call has succeeded, the data would not be lost as the data is already present in kernel pages which will eventually be written to storage. The only case you would lose data is if OS itself crashes (kernel panic, power off etc). The way to absolutely make sure your data has reached storage would be call fsync or msync (for mmapped regions) as the case might be.
- About the system load concern, yes calling msync/fsync for each request is going to slow your throughput drastically, so do that only if you have to. Remember you are really protecting against losing data on OS crashes which I would assume is rare and probably something most could live with. One general optimization done is to issue sync at regular intervals say 1 sec to get a good balance.
Solution 5:
Either the Linux manpage information is incorrect or Linux is horribly non-conformant. msync
is not supposed to have anything to do with whether the changes are committed to the logical state of the file, or whether other processes using mmap
or read
to access the file see the changes; it's purely an analogue of fsync
and should be treated as a no-op except for the purposes of ensuring data integrity in the event of power failure or other hardware-level failure.