What is the character encoding of String in Java?
Solution 1:
Java stores strings as UTF-16 internally.
-
"default encoding" isn't quite right. Java stores strings as UTF-16 internally, but the encoding used externally, the "system default encoding", varies from platform to platform, and can even be altered by things like environment variables on some platforms.
ASCII is a subset of Latin 1 which is a subset of Unicode. UTF-16 is a way of encoding Unicode. So if you perform your
int i = 'x'
test for any character that falls in the ASCII range you'll get the ASCII value. UTF-16 can represent a lot more characters than ASCII, however. -
From the java.lang.Character docs:
The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes.
So it's defined as part of the Java 2 platform that UTF-16 is used for these classes.
Solution 2:
1) Strings are objects, which typically contain a char
array and the strings's length. The character array is usually implemented as a contiguous array of 16-bit words, each one containing a Unicode character in native byte order.
2) Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent. Thus 'c'
, which is U+0063, becomes 0x0063
, or 99.
3) Since each String
is an object, it contains other information than its class members (e.g., class descriptor word, lock/semaphore word, etc.).
ADENDUM
The object contents depend on the JVM implementation (which determines the inherent overhead associated with each object), and how the class is actually coded (i.e., some libraries may be more efficient than others).
EXAMPLE
A typical implementation will allocate an overhead of two words per object instance (for the class descriptor/pointer, and a semaphore/lock control word); a String
object also contains an int
length and a char[]
array reference. The actual character contents of the string are stored in a second object, the char[]
array, which in turn is allocated two words, plus an array length word, plus as many 16-bit char
elements as needed for the string (plus any extra chars that were left hanging around when the string was created).
ADDENDUM 2
The case that one char represents one Unicode character is only true in most of the cases. This would imply UCS-2 encoding and true before 2005. But by now Unicode has become larger and Strings have to be encoded using UTF-16 -- where alas a single Unicode character may use two char
s in a Java String
.
Take a look at the actual source code for Apache's implementation, e.g. at:
http://www.docjar.com/html/api/java/lang/String.java.html
Solution 3:
While this doesn't answer your question, it is worth noting that... In the java byte code (class file), the string is stored in UTF-8. http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html
Solution 4:
Edit : thanks to LoadMaster for helping me correcting my answer :)
1) All internal String processing is made in UTF-16.
2) ASCII is a subset of UTF-16.
3) Internally in Java is UTF-16. For the rest, it depends on where you are, yes.