Converting UTF-8 to ISO-8859-1 in Java - how to keep it as single byte
I am trying to convert a string encoded in java in UTF-8 to ISO-8859-1. Say for example, in the string 'âabcd' 'â' is represented in ISO-8859-1 as E2. In UTF-8 it is represented as two bytes. C3 A2 I believe. When I do a getbytes(encoding) and then create a new string with the bytes in ISO-8859-1 encoding, I get a two different chars. â. Is there any other way to do this so as to keep the character the same i.e. âabcd?
Solution 1:
If you're dealing with character encodings other than UTF-16, you shouldn't be using java.lang.String
or the char
primitive -- you should only be using byte[]
arrays or ByteBuffer
objects. Then, you can use java.nio.charset.Charset
to convert between encodings:
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap(new byte[]{(byte)0xC3, (byte)0xA2});
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
Solution 2:
byte[] iso88591Data = theString.getBytes("ISO-8859-1");
Will do the trick. From your description it seems as if you're trying to "store an ISO-8859-1 String". String objects in Java are always implicitly encoded in UTF-16. There's no way to change that encoding.
What you can do, 'though is to get the bytes that constitute some other encoding of it (using the .getBytes()
method as shown above).
Solution 3:
Starting with a set of bytes which encode a string using UTF-8, creates a string from that data, then get some bytes encoding the string in a different encoding:
byte[] utf8bytes = { (byte)0xc3, (byte)0xa2, 0x61, 0x62, 0x63, 0x64 };
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
String string = new String ( utf8bytes, utf8charset );
System.out.println(string);
// "When I do a getbytes(encoding) and "
byte[] iso88591bytes = string.getBytes(iso88591charset);
for ( byte b : iso88591bytes )
System.out.printf("%02x ", b);
System.out.println();
// "then create a new string with the bytes in ISO-8859-1 encoding"
String string2 = new String ( iso88591bytes, iso88591charset );
// "I get a two different chars"
System.out.println(string2);
this outputs strings and the iso88591 bytes correctly:
âabcd
e2 61 62 63 64
âabcd
So your byte array wasn't paired with the correct encoding:
String failString = new String ( utf8bytes, iso88591charset );
System.out.println(failString);
Outputs
âabcd
(either that, or you just wrote the utf8 bytes to a file and read them elsewhere as iso88591)
Solution 4:
This is what I needed:
public static byte[] encode(byte[] arr, String fromCharsetName) {
return encode(arr, Charset.forName(fromCharsetName), Charset.forName("UTF-8"));
}
public static byte[] encode(byte[] arr, String fromCharsetName, String targetCharsetName) {
return encode(arr, Charset.forName(fromCharsetName), Charset.forName(targetCharsetName));
}
public static byte[] encode(byte[] arr, Charset sourceCharset, Charset targetCharset) {
ByteBuffer inputBuffer = ByteBuffer.wrap( arr );
CharBuffer data = sourceCharset.decode(inputBuffer);
ByteBuffer outputBuffer = targetCharset.encode(data);
byte[] outputData = outputBuffer.array();
return outputData;
}