How do I compare a Unicode string that has different bytes, but the same value?

I'm comparing Unicode strings between JSON objects.

They have the same value:

a = '人口じんこうに膾炙かいしゃする'
b = '人口じんこうに膾炙かいしゃする'

But they have different Unicode representations:

String a : u'\u4eba\u53e3\u3058\u3093\u3053\u3046\u306b\u81be\u7099\u304b\u3044\u3057\u3083\u3059\u308b'
String b : u'\u4eba\u53e3\u3058\u3093\u3053\u3046\u306b\u81be\uf9fb\u304b\u3044\u3057\u3083\u3059\u308b'

How can I compare between two Unicode strings on their value?


Solution 1:

Unicode normalization will get you there for this one:

>>> import unicodedata
>>> unicodedata.normalize("NFC", "\uf9fb") == "\u7099"
True

Use unicodedata.normalize on both of your strings before comparing them with == to check for canonical Unicode equivalence.

Character U+F9FB is a "CJK Compatibility" character. These characters decompose into one or more regular CJK characters when normalized.

Solution 2:

Character U+F9FB (炙) is a CJK Compatibility Ideograph. These characters are distinct code points from the regular CJK characters, but they decompose into one or more regular CJK characters when normalized.

Unicode has an official string collation algorithm called UCA designed for exactly this purpose. Python does not come with UCA support as of 3.7,* but there are third-party libraries like pyuca:

>>> from pyuca import Collator
>>> ck = Collator().sort_key
>>> ck(a) == ck(b)
True

For this case—and many others, but definitely not all—picking the appropriate normalization to apply to both strings before comparing will work, and it has the advantage of support built into the stdlib.

* The idea has been accepted in principle since 3.4, but nobody has written an implementation—in part because most of the core devs who care are using pyuca or one of the two ICU bindings, which have the advantage of working in current and older versions of Python.