Line reading chokes on 0x1A

I have the following file:

abcde
kwakwa
<0x1A>
line3
linllll

Where <0x1A> represents a byte with the hex value of 0x1A. When attempting to read this file in Python as:

for line in open('t.txt'):
    print line,

It only reads the first two lines, and exits the loop.

The solution seems to be to open the file in binary (or universal newline mode) - 'rb' or 'rU'. Can you explain this behavior ?

0x1A is Ctrl-Z, and DOS historically used that as an end-of-file marker. For example, try using a command prompt, and "type"ing your file. It will only display the content up the Ctrl-Z.

Python uses the Windows CRT function _wfopen, which implements the "Ctrl-Z is EOF" semantics.

Ned is of course correct.

If your curiosity runs a little deeper, the root cause is backwards compatibility taken to an extreme. Windows is compatible with DOS, which used Ctrl-Z as an optional end of file marker for text files. What you might not know is that DOS was compatible with CP/M, which was popular on small computers before the PC. CP/M's file system didn't keep track of file sizes down to the byte level, it only kept track by the number of floppy disk sectors. If your file wasn't an exact multiple of 128 bytes, you needed a way to mark the end of the text. This Wikipedia article implies that the selection of Ctrl-Z was based on an even older convention used by DEC.

Line reading chokes on 0x1A

Related

Recent Posts