Solution 1:

You can handle them all if you're careful enough.

Java's char is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).

See http://www.oracle.com/us/technologies/java/supplementary-142654.html for how to handle those characters in Java.

(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)

Solution 2:

Java uses UTF-16. A single Java char can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two chars. This is reflected by API methods such as String.codePointAt().

And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.

Solution 3:

To add to the other answers, some points to remember:

  • A Java char takes always 16 bits.

  • A Unicode character, when encoded as UTF-16, takes "almost always" (not always) 16 bits: that's because there are more than 64K unicode characters. Hence, a Java char is NOT a Unicode character (though "almost always" is).

  • "Almost always", above, means the 64K first code points of Unicode, range 0x0000 to 0xFFFF (BMP), which take 16 bits in the UTF-16 encoding.

  • A non-BMP ("rare") Unicode character is represented as two Java chars (surrogate representation). This applies also to the literal representation as a string: For example, the character U+20000 is written as "\uD840\uDC00".

  • Corolary: string.length() returns the number of java chars, not of Unicode chars. A string that has just one "rare" unicode character (eg U+20000) would return length() = 2 . Same consideration applies to any method that deals with char-sequences.

  • Java has little intelligence for dealing with non-BMP unicode characters as a whole. There are some utility methods that treat characters as code-points, represented as ints eg: Character.isLetter(int ch). Those are the real fully-Unicode methods.