Java Unicode encoding
Solution 1:
You can handle them all if you're careful enough.
Java's char
is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 char
s (a surrogate pair).
See http://www.oracle.com/us/technologies/java/supplementary-142654.html for how to handle those characters in Java.
(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)
Solution 2:
Java uses UTF-16. A single Java char
can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two char
s. This is reflected by API methods such as String.codePointAt()
.
And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.
Solution 3:
To add to the other answers, some points to remember:
A Java
char
takes always 16 bits.A Unicode character, when encoded as UTF-16, takes "almost always" (not always) 16 bits: that's because there are more than 64K unicode characters. Hence, a Java char is NOT a Unicode character (though "almost always" is).
"Almost always", above, means the 64K first code points of Unicode, range 0x0000 to 0xFFFF (BMP), which take 16 bits in the UTF-16 encoding.
A non-BMP ("rare") Unicode character is represented as two Java chars (surrogate representation). This applies also to the literal representation as a string: For example, the character U+20000 is written as "\uD840\uDC00".
Corolary:
string.length()
returns the number of java chars, not of Unicode chars. A string that has just one "rare" unicode character (eg U+20000) would returnlength() = 2
. Same consideration applies to any method that deals with char-sequences.Java has little intelligence for dealing with non-BMP unicode characters as a whole. There are some utility methods that treat characters as code-points, represented as ints eg:
Character.isLetter(int ch)
. Those are the real fully-Unicode methods.