How do I get str.translate to work with Unicode strings?
I have the following code:
import string
def translate_non_alphanumerics(to_translate, translate_to='_'):
not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
translate_table = string.maketrans(not_letters_or_digits,
translate_to
*len(not_letters_or_digits))
return to_translate.translate(translate_table)
Which works great for non-unicode strings:
>>> translate_non_alphanumerics('<foo>!')
'_foo__'
But fails for unicode strings:
>>> translate_non_alphanumerics(u'<foo>!')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in translate_non_alphanumerics
TypeError: character mapping must return integer, None or unicode
I can't make any sense of the paragraph on "Unicode objects" in the Python 2.6.2 docs for the str.translate() method.
How do I make this work for Unicode strings?
Solution 1:
The Unicode version of translate requires a mapping from Unicode ordinals (which you can retrieve for a single character with ord
) to Unicode ordinals. If you want to delete characters, you map to None
.
I changed your function to build a dict mapping the ordinal of every character to the ordinal of what you want to translate to:
def translate_non_alphanumerics(to_translate, translate_to=u'_'):
not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
translate_table = dict((ord(char), translate_to) for char in not_letters_or_digits)
return to_translate.translate(translate_table)
>>> translate_non_alphanumerics(u'<foo>!')
u'_foo__'
edit: It turns out that the translation mapping must map from the Unicode ordinal (via ord
) to either another Unicode ordinal, a Unicode string, or None (to delete). I have thus changed the default value for translate_to
to be a Unicode literal. For example:
>>> translate_non_alphanumerics(u'<foo>!', u'bad')
u'badfoobadbad'
Solution 2:
In this version you can relatively make one's letters to other
def trans(to_translate):
tabin = u'привет'
tabout = u'тевирп'
tabin = [ord(char) for char in tabin]
translate_table = dict(zip(tabin, tabout))
return to_translate.translate(translate_table)
Solution 3:
I came up with the following combination of my original function and Mike's version that works with Unicode and ASCII strings:
def translate_non_alphanumerics(to_translate, translate_to=u'_'):
not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
if isinstance(to_translate, unicode):
translate_table = dict((ord(char), unicode(translate_to))
for char in not_letters_or_digits)
else:
assert isinstance(to_translate, str)
translate_table = string.maketrans(not_letters_or_digits,
translate_to
*len(not_letters_or_digits))
return to_translate.translate(translate_table)
Update: "coerced" translate_to
to unicode for the unicode translate_table
. Thanks Mike.