Isn't the size of character in Java 2 bytes?

I used RandomAccessFile to read a byte from a text file.

public static void readFile(RandomAccessFile fr) {
    byte[] cbuff = new byte[1];
    fr.read(cbuff,0,1);
    System.out.println(new String(cbuff));
}

Why am I seeing one full character being read by this?


Solution 1:

A char represents a character in Java (*). It is 2 bytes large (or 16 bits).

That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).

When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset. Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.

If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).

That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.

(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.

Solution 2:

Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.

Some characters (ASCII) are single byte, but many others are multi-byte.

Java supports Unicode, thus according to:

Java Character Docs

The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).

Solution 3:

The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.

It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).

Solution 4:

In ASCII text file each character is just one byte