How to read Unicode input and compare Unicode strings in Python?
raw_input()
returns strings as encoded by the OS or UI facilities. The difficulty is knowing which is that decoding. You might attempt the following:
import sys, locale
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))
which should work correctly in most of the cases.
We need more data about not working Unicode comparisons in order to help you. However, it might be a matter of normalization. Consider the following:
>>> a1= u'\xeatre'
>>> a2= u'e\u0302tre'
a1
and a2
are equivalent but not equal:
>>> print a1, a2
être être
>>> print a1 == a2
False
So you might want to use the unicodedata.normalize()
method:
>>> import unicodedata as ud
>>> ud.normalize('NFC', a1)
u'\xeatre'
>>> ud.normalize('NFC', a2)
u'\xeatre'
>>> ud.normalize('NFC', a1) == ud.normalize('NFC', a2)
True
If you give us more information, we might be able to help you more, though.
It should work. raw_input
returns a byte string which you must decode using the correct encoding to get your unicode
object. For example, the following works for me under Python 2.5 / Terminal.app / OSX:
>>> bytes = raw_input()
日本語 Ελληνικά
>>> bytes
'\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e \xce\x95\xce\xbb\xce\xbb\xce\xb7\xce\xbd\xce\xb9\xce\xba\xce\xac'
>>> uni = bytes.decode('utf-8') # substitute the encoding of your terminal if it's not utf-8
>>> uni
u'\u65e5\u672c\u8a9e \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac'
>>> print uni
日本語 Ελληνικά
As for comparing unicode strings: can you post an example where the comparison doesn't work?
I'm not really sure, which format you mean by "Unicode format", there are several. UTF-8? UTF-16? In any case you should be able to read a normal string with raw_input
and then decode it using the strings decode
method:
raw = raw_input("Please input some funny characters: ")
decoded = raw.decode("utf-8")
If you have a different input encoding just use "utf-16" or whatever instead of "utf-8". Also see the codecs modules docs for different kinds of encodings.
Comparing then should work just fine with ==
. If you have string literals containing special characters you should prefix them with "u" to mark them as unicode:
if decoded == u"äöü":
print "Do you speak German?"
And if you want to output these strings again, you probably want to encode them again in the desired encoding:
print decoded.encode("utf-8")