How to cryptographically hash a JSON object?
Solution 1:
The problem is a common one when computing hashes for any data format where flexibility is allowed. To solve this, you need to canonicalize the representation.
For example, the OAuth1.0a protocol, which is used by Twitter and other services for authentication, requires a secure hash of the request message. To compute the hash, OAuth1.0a says you need to first alphabetize the fields, separate them by newlines, remove the field names (which are well known), and use blank lines for empty values. The signature or hash is computed on the result of that canonicalization.
XML DSIG works the same way - you need to canonicalize the XML before signing it. There is a proposed W3 standard covering this, because it's such a fundamental requirement for signing. Some people call it c14n.
I don't know of a canonicalization standard for json. It's worth researching.
If there isn't one, you can certainly establish a convention for your particular application usage. A reasonable start might be:
- lexicographically sort the properties by name
- double quotes used on all names
- double quotes used on all string values
- no space, or one-space, between names and the colon, and between the colon and the value
- no spaces between values and the following comma
- all other white space collapsed to either a single space or nothing - choose one
- exclude any properties you don't want to sign (one example is, the property that holds the signature itself)
- sign the result, with your chosen algorithm
You may also want to think about how to pass that signature in the JSON object - possibly establish a well-known property name, like "nichols-hmac" or something, that gets the base64 encoded version of the hash. This property would have to be explicitly excluded by the hashing algorithm. Then, any receiver of the JSON would be able to check the hash.
The canonicalized representation does not need to be the representation you pass around in the application. It only needs to be easily produced given an arbitrary JSON object.
Solution 2:
Instead of inventing your own JSON normalization/canonicalization you may want to use bencode. Semantically it's the same as JSON (composition of numbers, strings, lists and dicts), but with the property of unambiguous encoding that is necessary for cryptographic hashing.
bencode is used as a torrent file format, every bittorrent client contains an implementation.
Solution 3:
This is the same issue as causes problems with S/MIME signatures and XML signatures. That is, there are multiple equivalent representations of the data to be signed.
For example in JSON:
{ "Name1": "Value1", "Name2": "Value2" }
vs.
{
"Name1": "Value\u0031",
"Name2": "Value\u0032"
}
Or depending on your application, this may even be equivalent:
{
"Name1": "Value\u0031",
"Name2": "Value\u0032",
"Optional": null
}
Canonicalization could solve that problem, but it's a problem you don't need at all.
The easy solution if you have control over the specification is to wrap the object in some sort of container to protect it from being transformed into an "equivalent" but different representation.
I.e. avoid the problem by not signing the "logical" object but signing a particular serialized representation of it instead.
For example, JSON Objects -> UTF-8 Text -> Bytes. Sign the bytes as bytes, then transmit them as bytes e.g. by base64 encoding. Since you are signing the bytes, differences like whitespace are part of what is signed.
Instead of trying to do this:
{
"JSONContent": { "Name1": "Value1", "Name2": "Value2" },
"Signature": "asdflkajsdrliuejadceaageaetge="
}
Just do this:
{
"Base64JSONContent": "eyAgIk5hbWUxIjogIlZhbHVlMSIsICJOYW1lMiI6ICJWYWx1ZTIiIH0s",
"Signature": "asdflkajsdrliuejadceaageaetge="
}
I.e. don't sign the JSON, sign the bytes of the encoded JSON.
Yes, it means the signature is no longer transparent.
Solution 4:
JSON-LD can do normalitzation.
You will have to define your context.
Solution 5:
RFC 7638: JSON Web Key (JWK) Thumbprint includes a type of canonicalization. Although RFC7638 expects a limited set of members, we would be able to apply the same calculation for any member.
https://www.rfc-editor.org/rfc/rfc7638#section-3