Unicode, Unicode Big Endian or UTF-8? What is the difference? Which format is better?
Dunno. Which is better: a saw or a hammer? :-)
Unicode isn't UTF
There's a bit in the article that's a bit more relevant to the subject at hand though:
- UTF-8 focuses on minimizing the byte size for representation of characters from the ASCII set (variable length representation: each character is represented on 1 to 4 bytes, and ASCII characters all fit on 1 byte). As Joel puts it:
“Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings
UTF-32 focuses on exhaustiveness and fixed-length representation, using 4 bytes for all characters. It’s the most straightforward translation, mapping directly the Unicode code-point to 4 bytes. Obviously, it’s not very size-efficient.
UTF-16 is a compromise, using 2 bytes most of the time, but expanding to 2 * 2 bytes per character to represent certain characters, those not included in the Basic Multilingual Plane (BMP).
Also see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
For European languages, UTF-8 is smaller. For Oriental languages, the difference is not so clear-cut.
Both will handle all possible Unicode characters, so it should make no difference in compatibility.
There are more Unicode character encodings than you may think.
-
UTF 8
The UTF-8 encoding is variable-width, ranging from 1-4 bytes, with the upper bits of each byte reserved as control bits. The leading bits of the first byte indicate the total number of bytes used for that character. The scalar value of a character's code point is the concatenation of the non-control bits. In this table,
x
represents the lowest 8 bits of the Unicode value,y
represents the next higher 8 bits, andz
represents the bits higher than that.Unicode Byte1 Byte2 Byte3 Byte4 U+0000-U+007F 0xxxxxxx U+0080-U+07FF 110yyyxx 10xxxxxx U+0800-U+FFFF 1110yyyy 10yyyyxx 10xxxxxx U+10000-U+10FFFF 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
- UCS-16
- UCS-16BE
- UCS-16LE
- UTF-16
- UTF-16BE
- UTF-16LE
- UTF-32
- UTF-32-BE