What goes on behind the curtains during disk I/O?

Solution 1:

Indeed, at least on my system with GNU libc, it looks like stdio is reading 4kB blocks before writing back the changed portion. Seems bogus to me, but I imagine somebody thought it was a good idea at the time.

I checked by writing a trivial C program to open a file, write a small of data once, and exit; then ran it under strace, to see which syscalls it actually triggered. Writing at an offset of 10000, I saw these syscalls:

lseek(3, 8192, SEEK_SET)                = 8192
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1808) = 1808
write(3, "hello", 5)                    = 5

Seems that you'll want to stick with the low-level Unix-style I/O for this project, eh?

Solution 2:

The C standard library functions perform additional buffering, and are generally optimized for streaming reads, rather than random IO. On my system, I don't observe the spurious reads that Jamey Sharp saw I only see spurious reads when the offset is not aligned to a page size - it could be that the C library always tries to keep its IO buffer aligned to 4kb or something.

In your case, if you're doing lots of random reads and writes across a reasonably small dataset, you'd likely be best served using pread/pwrite to avoid having to make seeking syscalls, or simply mmaping the dataset and writing to it in memory (likely to be the fastest, if your dataset fits in memory).