JSON Specification and usage of BOM/charset-encoding

Solution 1:

You are right

  1. The BOM character is illegal in JSON (and not needed)
  2. The MIME charset is illegal in JSON (and not needed as well)

RFC 7159, Section 8.1:

Implementations MUST NOT add a byte order mark to the beginning of a JSON text.

This is put as clearly as it can be. This is the only "MUST NOT" in the entire RFC.

RFC 7159, Section 11:

The MIME media type for JSON text is application/json.
Type name: application
Subtype name: json
Required parameters: n/a
Optional parameters: n/a
[...]
Note: No "charset" parameter is defined for this registration.

JSON encoding

The only valid encodings of JSON are UTF-8, UTF-16 or UTF-32 and since the first character (or first two if there is more than one character) will always have a Unicode value lower than 128 (there is no valid JSON text that can include higher values of the first two characters) it is always possible to know which of the valid encodings and which endianness was used just by looking at the byte stream.

RFC recommendation

The JSON RFC says that the first two characters will always be below 128 and you should check the first 4 bytes.

I would put it differently: since a string "1" is also valid JSON there is no guarantee that you have two characters at all - let alone 4 bytes.

My recommendation

My recommendation of determining the JSON encoding would be slightly different:

Fast method:

  1. if you have 1 byte and it's not NUL - it's UTF-8
    (actually the only valid character here would be an ASCII digit)
  2. if you have 2 bytes and none of them are NUL - it's UTF-8
    (those must be ASCII digits with no leading '0', {}, [] or "")
  3. if you have 2 bytes and only the first is NUL - it's UTF-16BE
    (it must be an ASCII digit encoded as UTF-16, big endian)
  4. if you have 2 bytes and only the second is NUL - it's UTF-16LE
    (it must be an ASCII digit encoded as UTF-16, little endian)
  5. if you have 3 bytes and they are not NUL - it's UTF-8
    (again, ASCII digits with no leading '0's, "x", [1] etc.)
  6. if you have 4 bytes or more than the RFC method works:
  • 00 00 00 xx - it's UTF-32BE
  • 00 xx 00 xx - it's UTF-16BE
  • xx 00 00 00 - it's UTF-32LE
  • xx 00 xx 00 - it's UTF-16LE
  • xx xx xx xx - it's UTF-8

but it only works if it is indeed a valid string in any of those encodings, which it may not be. Moreover, even if you have a valid string in one of the 5 valid encodings, it may still not be a valid JSON.

My recommendation would be to have a slightly more rigid verification than the one included in the RFC to verify that you have:

  1. a valid encoding of either UTF-8, UTF-16 or UTF-32 (LE or BE)
  2. a valid JSON

Looking only for NUL bytes is not enough.

That having been said, at no point you need to have any BOM characters to determine the encoding, neither you need MIME charset - both of which are not needed and not valid in JSON.

You only have to use the binary content-transfer-encoding when using UTF-16 and UTF-32 because those may contain NUL bytes. UTF-8 doesn't have that problem and 8bit content-transfer-encoding is fine as it doesn't contain NUL in the string (though it still contains bytes >= 128 so 7-bit transfer will not work - there is UTF-7 that would work for such a transfer but it wouldn't be valid JSON, as it is not one of the only valid JSON encodings).

See also this answer for more details.

Answering your followup questions

Are these correct deductions?

Yes.

Will I run into problem when implementing web-services or web-clients which adhere to this interpretations?

Possibly, if you interact with incorrect implementations. Your implementation MAY ignore the BOM for the sake of interoperability with incorrect implementations - see RFC 7159, Section 1.8:

In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Also, ignoring the MIME charset is the expected behavior of compliant JSON implementations - see RFC 7159, Section 11:

Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.

Security considerations

I am not personally convinced that silently accepting incorrect JSON streams is always desired. If you decide to accept input with BOM and/or MIME charset then you will have to answer those questions:

  • What to do in case of a mismatch between MIME charset and actual encoding?
  • What to do in case of a mismatch between BOM and MIME charset?
  • What to do in case of a mismatch between BOM and the actual encoding?
  • What to do when all of them differ?
  • What to do with encodings other than UTF-8/16/32?
  • Are you sure that all security checks will work as expected?

Having the encoding defined in three independent places - in a JSON string itself, in the BOM and in the MIME charset makes the question inevitable: what to do if they disagree. And unless you reject such an input then there is no one obvious answer.

For example, if you have a code that verifies the JSON string to see if it's safe to eval it in JavaScript - it might be misled by the MIME charset or the BOM and treat is as a different encoding than it actually is and not detect strings that it would detect if it used the correct encoding. (A similar problem with HTML has led to XSS attacks in the past.)

You have to be prepared for all of those possibilities whenever you decide to accept incorrect JSON strings with multiple and possibly conflicting encoding indicators. It's not to say that you should never do that because you may need to consume input generated by incorrect implementations. I'm just saying that you need to thoroughly consider the implications.

Nonconforming implementations

Should I file bugs against web browsers which violate the the two properties above?

Certainly - if they call it JSON and the implementation doesn't conform to the JSON RFC then it is a bug and should be reported as such.

Have you found any specific implementations that doesn't conform to the JSON specification and yet they advertise to do so?

Solution 2:

I think you are correct about question 1, due to Section 3 about the first two characters being ASCII and the unicode FAQ on BOMs, see "Q: How I should deal with BOMs?", answer part 3. Your emphasis on MUST may be a bit strong: the FAQ seems to imply SHOULD.

Don't know the answer to question 2.