Why doesn't Python's mmap work with large files?
Solution 1:
From IEEE 1003.1:
The mmap() function shall establish a mapping between a process' address space and a file, shared memory object, or [TYM] typed memory object.
It needs all the virtual address space because that's exactly what mmap()
does.
The fact that it isn't really running out of memory doesn't matter - you can't map more address space than you have available. Since you then take the result and access as if it were memory, how exactly do you propose to access more than 2^32 bytes into the file? Even if mmap()
didn't fail, you could still only read the first 4GB before you ran out of space in a 32-bit address space. You can, of course, mmap()
a sliding 32-bit window over the file, but that won't necessarily net you any benefit unless you can optimize your access pattern such that you limit how many times you have to visit previous windows.
Solution 2:
Sorry to answer my own question, but I think the real problem I had was not realising that mmap was a standard POSIX system call with particular characterisatations and limitations and that the Python mmap is supposed just to expose its functionality.
The Python documentation doesn't mention the POSIX mmap and so if you come at it as a Python programmer without much knowledge of POSIX (as I did) then the address space problem appears quite arbitrary and badly designed!
Thanks to the other posters for teaching me the true meaning of mmap. Unfortunately no one has suggested a better alternative to my hand-crafted class for treating large files as strings, so I shall have to stick with it for now. Perhaps I will clean it up and make it part of my module's public interface when I get the chance.