What is XML BOM and how do I detect it?

What exactly is the BOM in a ANSI XML document and should it be removed? Should a XML document be in UTF-8 instead? Can anyone tell me a Java method that will detect the BOM? The BOM consists of the characters EF BB BF .


Solution 1:

For a ANSI XML file it should actually be removed. If you want to use UTF-8 you don't really need it. Only for UTF-16 and UTF-32 it is needed.

The Byte-Order-Mark (or BOM), is a special marker added at the very beginning of an Unicode file encoded in UTF-8, UTF-16 or UTF-32. It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF-8.

(Source: https://www.opentag.com/xfaq_enc.htm#enc_bom)

Regarding the question on how detect this in java.

Check the following answer to this question: Java : How to determine the correct charset encoding of a stream

Basically just read in the first few bytes yourself and then determine if you may have found a BOM.

Solution 2:

The byte order mark is likely to be one of these byte sequences:

     UTF-8 BOM: ef bb bf 
  UTF-16BE BOM: fe ff 
  UTF-16LE BOM: ff fe 
  UTF-32BE BOM: 00 00 fe ff 
  UTF-32LE BOM: ff fe 00 00 

These are the variously encoded forms of the Unicode codepoint U+FEFF. This can be expressed as a Java char literal using '\uFEFF' (Java char values are implicitly UTF-16). Since U+FEFF isn't in most encodings, it is not possible for this BOM codepoint to be encoded by them. (More on encoding the BOM using Java here.)

When it comes to BOMs and XML, they are optional (see also the Unicode BOM FAQ). Detection of encoding in XML is relatively straightforward if the encoding is specified in the declaration. Always make sure that the XML declaration (<?xml version="1.0" encoding="UTF-8"?>) matches the encoding used to write the document. If you are strict about this, parsers should be able to interpret your documents correctly. (XML spec on encoding detection.)

I advocate encoding as Unicode wherever possible (see also the 10 Commandments of Unicode). That said, XML allows the representation of any Unicode character via escape entities (e.g. 'A' could be represented by &#x0041;), so it isn't necessarily a requirement to avoid data loss.