Why does VIM show the Unicode code point and not the UTF-8 code value?

Consider this supposed line of code that I found in a PHP blog, note the quotes:

throw new Exception(“That's not a server name!”);

Those quotes are RIGHT DOUBLE QUOTATION MARK (Unicode code point: U+201D; UTF-8 hex-encoded value: 0xE2 0x80 0x9D). Pressing ga in VIM displays the following in the status bar:

<”> 8221, Hex 201d, Octal 20035

Why is the Unicode code point being displayed and not the UTF-8 code value?

Considering that the file is stored as UTF-8 and it is the terminal translating the bytes into glyphs, I would expect VIM to show the raw value of the file (UTF-8 code value), not to translate it into a Unicode code point.


Solution 1:

Why is the Unicode code point being displayed and not the UTF-8 code value?

Because you use ga:

<”> 8221, Hex 201d, Octal 20035

instead of g8:

e2 80 9d

Solution 2:

Because Vim is a text editor and works with text codepoints, not bytes. There is more than just one translation happening – when opening a file, the editor must decode it from the byte encoding to an internal representation (usually Unicode); when saving back to a file, or when displaying its contents on the terminal, the editor must encode the text back to bytes.

One reason for this is simple – the file and the terminal might be using different character sets. For example, you're editing some old documents in ISO 8859-13 or KOI8-R, and want them to show up correctly on a UTF-8 terminal.

The second reason, again, is that text editors work with text. For example, is one character and its width is one terminal cell, regardless of its byte encoding (3 bytes in UTF-8, 1 byte in Windows-1257, 2 bytes in Shift-JIS, and so on). If Vim merely counted it as three bytes but the terminal showed it as one, it would result in vertical splits being misaligned, lines wrapped too soon, tabs appearing too narrow, and so on.

Instead of this...                ...you would see this.

┌───────────────────────────┐     ┌───────────────────────────┐
│She said, "Hello."         │     │She said, "Hello."         │
│                           │     │                           │
│She said, “Hello.”         │     │She said, “Hello.”     │
│                           │     │                           │
│Ji pasakė, „Sveiki“.       │     │Ji pasakė, „Sveiki“. │
└───────────────────────────┘     └───────────────────────────┘

Not to mention, you'd have to Backspace three times to delete a single character.