Do text files store their encoding method for later decoding?
- I was wondering if some text files store their encoding method along their text content for later decoding?
- Or is it the text viewer's job to guess the encoding method for a given text file, and the guessing may not always be correct? If yes, how does a text viewer guess that?
Solution 1:
I was wondering if some text files store their encoding method along their text content for later decoding?
Mark Szymanski's answer is correct - there is no explicit encoding information in a plain text file - that's the definition of "plain text file", the "plain" refers to the fact that there is no meta-data in the file.
However, some applications will place a byte-order mark (BOM) in text files encoded as UTF-16 or UTF-32/UCS-4. The BOM is not really meant to indicate the encoding (it indicates byte order, as the name says), but many applications will use the presence of the BOM to recognize UTF-16/UTF-32, so it serves as an encoding indicator.
Or is it the text viewer's job to guess the encoding method for a given text file, and the guessing may not always be correct? If yes, how does a text viewer guess that?
Yes, the text viewer can only guess. It usually uses some heuristics:
- In some encodings (notably in UTF-8) not all byte sequences are valid. So an application can just try to decode the file as UTF-8. If it succeeds, the file is probably UTF-8; if it fails by finding an invalid byte sequence, it is not. This is how e.g.
vim
works by default: It will first try to use UTF-8 when reading a file; if that fails, it falls back to ISO-8859-1. - In most older 8-bit encodings, any byte sequence is valid. In that case, you can sometimes guess encoding by looking at the byte histogram (frequency of different bytes/byte sequences). Internet Explorer used to do this to "guess" the encoding of a page. However, this is very error-prone, so few programs do this.
In most cases, a program must be explicitly told what the encoding of a text file is, otherwise it will not be able to read it correctly.
Solution 2:
Plain text files do not store any information about their encoding. A viewer determines it based on the character encoding you have set for it. It can not determine it by itself, since it's all the same to the computer.