file.tell() inconsistency
Does anybody happen to know why when you iterate over a file this way:
Input:
f = open('test.txt', 'r')
for line in f:
print "f.tell(): ",f.tell()
Output:
f.tell(): 8192
f.tell(): 8192
f.tell(): 8192
f.tell(): 8192
I consistently get the wrong file index from tell(), however, if I use readline I get the appropriate index for tell():
Input:
f = open('test.txt', 'r')
while True:
line = f.readline()
if (line == ''):
break
print "f.tell(): ",f.tell()
Output:
f.tell(): 103
f.tell(): 107
f.tell(): 115
f.tell(): 124
I'm running python 2.7.1 BTW.
Using open files as an iterator uses a read-ahead buffer to increase efficiency. As a result, the file pointer advances in large steps across the file as you loop over the lines.
From the File Objects documentation:
In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the
next()
method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combiningnext()
with other file methods (likereadline()
) does not work right. However, usingseek()
to reposition the file to an absolute position will flush the read-ahead buffer.
If you need to rely on .tell()
, don't use the file object as an iterator. You can turn .readline()
into an iterator instead (at the price of some performance loss):
for line in iter(f.readline, ''):
print f.tell()
This uses the iter()
function sentinel
argument to turn any callable into an iterator.
The answer lies in the following part of Python 2.7 source code (fileobject.c
):
#define READAHEAD_BUFSIZE 8192
static PyObject *
file_iternext(PyFileObject *f)
{
PyStringObject* l;
if (f->f_fp == NULL)
return err_closed();
if (!f->readable)
return err_mode("reading");
l = readahead_get_line_skip(f, 0, READAHEAD_BUFSIZE);
if (l == NULL || PyString_GET_SIZE(l) == 0) {
Py_XDECREF(l);
return NULL;
}
return (PyObject *)l;
}
As you can see, file
's iterator interface reads the file in blocks of 8KB. This explains why f.tell()
behaves the way it does.
The documentation suggests it's done for performance reasons (and does not guarantee any particular size of the readahead buffer).