UTF-16 decoding fails when reading from csv
Trying to read a csv that contains some UTF-16 strings. When I print these string as extracted from the csv they don't decode to cyrillic/japanese/whatever as they should, instead just print the encoded utf-16. Yet when I copy/paste the strings and print them directly, there's no problem.
data = pd.read_csv('stuff.csv')
for index,row in data.iterrows():
print('\u0423\u043a\u0440\u0430\u0438\u043d\u0430')
print(row[1])
outputs:
Украина
\u0423\u043a\u0440\u0430\u0438\u043d\u0430
what am I missing? Note that some of the CSV is ascii so I can't just set encoding to utf-16 for the csv.
Edit: I'm trying to conditionally decode the strings where utf-16 is detected. Tried both the string taken from the csv and the copy/pasted string:
print(bytearray(row[1].encode()).decode('utf-16'))
print(b'\u0423\u043a\u0440\u0430\u0438\u043d\u0430'.decode('utf-16'))
For some reason it decodes to chinese characters:
畜㐰㌲畜㐰愳畜㐰〴畜㐰〳畜㐰㠳畜㐰搳畜㐰〳
畜㐰㌲畜㐰愳畜㐰〴畜㐰〳畜㐰㠳畜㐰搳畜㐰〳
Solution 1:
Assuming you actually have \u
escapes in the file, you can use the Python ast
module to get access to the interpreter's actual parser:
from ast import literal_eval
...
print(literal_eval('"'+row[1]+'"'))
Solution 2:
pandas.read_csv
has an encoding
argument.
Try data = pd.read_csv('stuff.csv', encoding='utf-16')