What goes on behind the curtains during disk I/O?
Solution 1:
Indeed, at least on my system with GNU libc, it looks like stdio is reading 4kB blocks before writing back the changed portion. Seems bogus to me, but I imagine somebody thought it was a good idea at the time.
I checked by writing a trivial C program to open a file, write a small of data once, and exit; then ran it under strace, to see which syscalls it actually triggered. Writing at an offset of 10000, I saw these syscalls:
lseek(3, 8192, SEEK_SET) = 8192
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1808) = 1808
write(3, "hello", 5) = 5
Seems that you'll want to stick with the low-level Unix-style I/O for this project, eh?
Solution 2:
The C standard library functions perform additional buffering, and are generally optimized for streaming reads, rather than random IO. On my system, I don't observe the spurious reads that Jamey Sharp saw I only see spurious reads when the offset is not aligned to a page size - it could be that the C library always tries to keep its IO buffer aligned to 4kb or something.
In your case, if you're doing lots of random reads and writes across a reasonably small dataset, you'd likely be best served using pread
/pwrite
to avoid having to make seeking syscalls, or simply mmap
ing the dataset and writing to it in memory (likely to be the fastest, if your dataset fits in memory).