Reading UTF-8 - BOM marker

In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.

Take a look at this solution: Handle UTF8 file with BOM

The easiest fix is probably just to remove the resulting \uFEFF from the string, since it is extremely unlikely to appear for any other reason.

tmp = tmp.replace("\uFEFF", "");

Also see this Guava bug report

Use the Apache Commons library.

Class: org.apache.commons.io.input.BOMInputStream

Example usage:

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}

Reading UTF-8 - BOM marker

Related

Recent Posts