UnicodeDecodeError: ('utf-8' codec) while reading a csv file [duplicate]
Known encoding
If you know the encoding of the file you want to read in, you can use
pd.read_csv('filename.txt', encoding='encoding')
These are the possible encodings: https://docs.python.org/3/library/codecs.html#standard-encodings
Unknown encoding
If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.
Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).
Looking at the documentation for the Pandas read_csv()
function, I see it has an encoding
parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252"
to your read_csv()
call, as follows:
df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")
Note that I added the character r
in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv
to new-ss.csv
, where the string D:\new-ss.csv
would be read as D
, :
, newline character, e
, w
, etc.
Anyway, try that encoding parameter on your first read_csv()
call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)
One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.