How to read Unicode input and compare Unicode strings in Python?

raw_input() returns strings as encoded by the OS or UI facilities. The difficulty is knowing which is that decoding. You might attempt the following:

import sys, locale
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))

which should work correctly in most of the cases.

We need more data about not working Unicode comparisons in order to help you. However, it might be a matter of normalization. Consider the following:

>>> a1= u'\xeatre'
>>> a2= u'e\u0302tre'

a1 and a2 are equivalent but not equal:

>>> print a1, a2
être être
>>> print a1 == a2
False

So you might want to use the unicodedata.normalize() method:

>>> import unicodedata as ud
>>> ud.normalize('NFC', a1)
u'\xeatre'
>>> ud.normalize('NFC', a2)
u'\xeatre'
>>> ud.normalize('NFC', a1) == ud.normalize('NFC', a2)
True

If you give us more information, we might be able to help you more, though.


It should work. raw_input returns a byte string which you must decode using the correct encoding to get your unicode object. For example, the following works for me under Python 2.5 / Terminal.app / OSX:

>>> bytes = raw_input()
日本語 Ελληνικά
>>> bytes
'\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e \xce\x95\xce\xbb\xce\xbb\xce\xb7\xce\xbd\xce\xb9\xce\xba\xce\xac'

>>> uni = bytes.decode('utf-8') # substitute the encoding of your terminal if it's not utf-8
>>> uni
u'\u65e5\u672c\u8a9e \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac'

>>> print uni
日本語 Ελληνικά

As for comparing unicode strings: can you post an example where the comparison doesn't work?


I'm not really sure, which format you mean by "Unicode format", there are several. UTF-8? UTF-16? In any case you should be able to read a normal string with raw_input and then decode it using the strings decode method:

raw = raw_input("Please input some funny characters: ")
decoded = raw.decode("utf-8")

If you have a different input encoding just use "utf-16" or whatever instead of "utf-8". Also see the codecs modules docs for different kinds of encodings.

Comparing then should work just fine with ==. If you have string literals containing special characters you should prefix them with "u" to mark them as unicode:

if decoded == u"äöü":
  print "Do you speak German?"

And if you want to output these strings again, you probably want to encode them again in the desired encoding:

print decoded.encode("utf-8")