How default is the default encoding (UTF-8) in the XML Declaration?

Solution 1:

The Short Answer

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you're interested in), there is no difference between the two declarations.

The long answer is far more interesting though.

What The Spec Says

If you look at Appendix F1 of the XML specification, that explains the process that should be followed to determine the encoding when there is no external encoding information.

If the document is encoded as one of the UTF variants, the parser should be able to detect the encoding within the first 4 bytes, either from the Byte Order Mark, or the start of the XML declaration.

However, according to the spec, it should still read the encoding declaration.

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity.

If they don't match, according to section 4.3.3:

...it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

Encoded UTF-16, Declared UTF-8

Let's see what happens in reality when we create an XML document encoded as UTF-16 but with the encoding declaration set to UTF-8.

Opera, Firefox and Chrome all interpret the document as UTF-16, ignoring the encoding declaration. Internet Explorer (version 9 at least), displays a blank document, but no actual error.

So if you include a UTF-8 encoding declaration on your UTF-8 document and someone at a later stage converts it to UTF-16, it'll work in most browsers, but fail in IE (and, I assume, most Microsoft XML APIs). If you had left the encoding declaration off, you would have been fine.

Technically I think IE is the most accurate. The fact that it doesn't display an error as such might be explained by the fact that the error is occurring at the encoding level rather than the XML level. It is assumedly doing its best to interpret the UTF-16 characters as UTF-8, failing to find any characters that decode, and ending up passing on an empty character sequence to the XML parser.

Encoded UTF-8, Declared Otherwise

You might now think that Firefox, Chrome and Opera are just ignoring the encoding declaration altogether, but that's not always the case.

If you encode a document as UTF-8 (with a byte order marker so it's unmistakable as anything else), but set the encoding declaration to Latin1, all of the browsers will successfully decode the content as Latin1, ignoring the UTF-8 BOM.

Again this seems right to me. The fact that the BOM characters aren't valid in Latin1 just means they are silently dropped at the character decoding level.

This doesn't work for all declared encodings on a UTF-8 document though. If the declared encoding is UTF-16, we're back with Opera, Firefox and Chrome ignoring the declared encoding, while Internet Explorer returns a blank document.

Essentially, anything that makes IE return a blank document is going to make other browsers ignore the declared encoding.

Other Inconsistencies

It's also worth mentioning the importance of the Byte Order Mark. According to section 4.3.3 of the spec:

Entities encoded in UTF-16 MUST [...] begin with the Byte Order Mark

However, if you try and read a UTF-16 encoded XML document without a BOM, most browsers will nevertheless accept it as valid. Only Firefox reports it as an XML Parsing Error.

External Encoding Information

Up to now, we've been considering what happens when there is no external encoding information, but, as others have mentioned, if the document is received via HTTP or enclosed in a MIME envelope of some sort, the encoding information from those sources should take preference over the document encoding.

Most of the details for the various XML MIME types are described in RFC3023. However, the reality is somewhat different from what is specified.

First of all, text/xml with an omitted charset parameter should use a charset of US-ASCII, but that requirement has almost always been ignored. Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none.

Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.

The only time the encoding from the Content-Type seems to take precedence is when there is no BOM and an explicit charset is specified in the Content-Type.

In any event, there are no cases (involving Content-Type) where including a UTF-8 XML encoding declaration on a UTF-8 document is any different from not having an encoding declaration at all.

Solution 2:

In isolation, both are equivalent. You have already cited the relevant parts of the specifications which show that both declarations are equivalent.

However XML can have an envelope, such as the HTTP Content-Type header. The W3C specifies that this envelope information has priority over any other declarations in the file. So for example, if you are retrieving XML via http, you could potentially get this:

HTTP/1.1 200 OK
Content-Type: text/xml

<root/>

In this case, the XML should be read as ascii, because the default charset for text/* mime types is ascii. This is why you should use application/xml mime types--these default to utf-8. The "application" prefix means that the relevant application specifications define things like default encoding. (I.e. the XML spec takes over.) With text/* mime types, the default is ascii and the charset parameter must be included in the mime type to change charset.

Here's another case:

HTTP/1.1 200 OK
Content-Type: text/xml; charset=win-1252

<?xml version="1.0" encoding="utf-8"?>
<root/>

In this case, a conforming XML processor should read this file as win-1252, not utf-8.

Another case:

HTTP/1.1 200 OK
Content-Type: application/xml

<?xml version="1.0" encoding="win-1252"?>
<root/>

Here the encoding is win-1252.

HTTP/1.1 200 OK
Content-Type: application/xml; charset=ascii

<?xml version="1.0" encoding="win-1252"?>
<root/>

Here the encoding is ascii.

Solution 3:

It would not be unreasonable for the second declaration to be rejected if it arrived at the start of a document that had already been detected as having a non-UTF-8 compatible encoding (such as UTF-16). However, given your statement that the document is UTF-8 encoded, there is no difference between how they would be treated.

An externally-specified encoding would take precedence in both cases; both documents would still be treated identically.

Solution 4:

The way I read the spec, UTF-8 is not the default encoding in an XML declaration. It is only the default encoding "for an entity which begins with neither a Byte Order Mark nor an encoding declaration". If a document is in UTF-16 and has a BOM, it may have an XML declaration without an encoding declaration or no XML declaration at all and still be valid XML.

Only for documents without a BOM, the two XML declarations you mentioned should be equivalent.

How default is the default encoding (UTF-8) in the XML Declaration?

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Related

Recent Posts