UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1

I'm having a few issues trying to encode a string to UTF-8. I've tried numerous things, including using string.encode('utf-8') and unicode(string), but I get the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)

This is my string:

(。・ω・。)ノ

I don't see what's going wrong, any idea?

Edit: The problem is that printing the string as it is does not show properly. Also, this error when I try to convert it:

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)

Solution 1:

This is to do with the encoding of your terminal not being set to UTF-8. Here is my terminal

$ echo $LANG
en_GB.UTF-8
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> 

On my terminal the example works with the above, but if I get rid of the LANG setting then it won't work

$ unset LANG
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
>>> 

Consult the docs for your linux variant to discover how to make this change permanent.

Solution 2:

try:

string.decode('utf-8')  # or:
unicode(string, 'utf-8')

edit:

'(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'.decode('utf-8') gives u'(\uff61\uff65\u03c9\uff65\uff61)\uff89', which is correct.

so your problem must be at some oter place, possibly if you try to do something with it were there is an implicit conversion going on (could be printing, writing to a stream...)

to say more we'll need to see some code.

Solution 3:

My +1 to mata's comment at https://stackoverflow.com/a/10561979/1346705 and to the Nick Craig-Wood's demonstration. You have decoded the string correctly. The problem is with the print command as it converts the Unicode string to the console encoding, and the console is not capable to display the string. Try to write the string into a file and look at the result using some decent editor that supports Unicode:

import codecs

s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
s1 = s.decode('utf-8')
f = codecs.open('out.txt', 'w', encoding='utf-8')
f.write(s1)
f.close()

Then you will see (。・ω・。)ノ.

Solution 4:

If you are working on a remote host, look at /etc/ssh/ssh_config on your local PC.

When this file contains a line:

SendEnv LANG LC_*

comment it out with adding # at the head of line. It might help.

With this line, ssh sends language related environment variables of your PC to the remote host. It causes a lot of problems.