UTF-16 decoding fails when reading from csv

Trying to read a csv that contains some UTF-16 strings. When I print these string as extracted from the csv they don't decode to cyrillic/japanese/whatever as they should, instead just print the encoded utf-16. Yet when I copy/paste the strings and print them directly, there's no problem.

data = pd.read_csv('stuff.csv')
for index,row in data.iterrows():
    print('\u0423\u043a\u0440\u0430\u0438\u043d\u0430')
    print(row[1])

outputs:

Украина
\u0423\u043a\u0440\u0430\u0438\u043d\u0430

what am I missing? Note that some of the CSV is ascii so I can't just set encoding to utf-16 for the csv.

Edit: I'm trying to conditionally decode the strings where utf-16 is detected. Tried both the string taken from the csv and the copy/pasted string:

print(bytearray(row[1].encode()).decode('utf-16'))
print(b'\u0423\u043a\u0440\u0430\u0438\u043d\u0430'.decode('utf-16'))

For some reason it decodes to chinese characters:

畜㐰㌲畜㐰愳畜㐰〴畜㐰〳畜㐰㠳畜㐰搳畜㐰〳
畜㐰㌲畜㐰愳畜㐰〴畜㐰〳畜㐰㠳畜㐰搳畜㐰〳

Solution 1:

Assuming you actually have \u escapes in the file, you can use the Python ast module to get access to the interpreter's actual parser:

from ast import literal_eval
...
    print(literal_eval('"'+row[1]+'"'))

Solution 2:

pandas.read_csv has an encoding argument.

Try data = pd.read_csv('stuff.csv', encoding='utf-16')

UTF-16 decoding fails when reading from csv

Solution 1:

Solution 2:

Related

Recent Posts