Split function add: \xef\xbb\xbf...\n to my list
I want to open my file.txt
and split all data from this file.
Here is my file.txt
:
some_data1 some_data2 some_data3 some_data4 some_data5
and here is my python code:
>>>file_txt = open("file.txt", 'r')
>>>data = file_txt.read()
>>>data_list = data.split(' ')
>>>print data
some_data1 some_data2 some_data3 some_data4 some_data5
>>>print data_list
['\xef\xbb\xbfsome_data1', 'some_data1', "some_data1", 'some_data1', 'some_data1\n']
As you can see here, when I print my data_list
it adds to my list this: \xef\xbb\xbf
and this: \n
. What are these and how can I clean my list from them.
Thanks.
Solution 1:
Your file contains UTF-8 BOM in the beginning.
To get rid of it, first decode your file contents to unicode.
fp = open("file.txt")
data = fp.read().decode("utf-8-sig").encode("utf-8")
But better don't encode it back to utf-8
, but work with unicode
d text. There is a good rule: decode all your input text data to unicode as soon as possible, and work only with unicode; and encode the output data to the required encoding as late as possible. This will save you from many headaches.
To read bigger files in a certain encoding, use io.open
or codecs.open
.
Also check this.
Use str.strip()
or str.rstrip()
to get rid of the newline character \n
.
Solution 2:
The \xef\xbb\xbf
is a Byte Order Mark for UTF-8 - the \x
is an escape sequence indicating the next two characters are a hex sequence representing the character code.
The \n
is a new line character. To remove this, you can use rstrip()
.
data.rstrip()
data_list = data.split(' ')
To remove the byte order mark, you can use io.open
(assuming you're using 2.6 or 2.7) to open the file in utf-8
mode. Note that can be a bit slower as it's implemented in Python - if speed or older versions of Python are necessary, take a look at codecs.open
.
Try something like this:
import io
# Make sure we don't lose the list when we close the file
data_list = []
# Use `with` to ensure the file gets cleaned up properly
with io.open('file.txt', 'r', encoding='utf-8') as file:
data = file.read() # Be careful when using read() with big files
data.rstrip() # Chomp the newline character
data_list = data.split(' ')
print data_list
Solution 3:
As the others mentioned, you are dealing with a file that contains UTF-8 BOM at its beginning.
They all tell you how to deal with it or removing it directly.
BUT, if you do happen to have to work with only one static file (or a small static set of them), you may wish to actively remove the BOM altogether so you simply don't have to deal with it.
As a matter of fact, most text editors will allow you to convert from one encoding to another and sometimes UTF-8 and UTF-8 with BOM are listed separately.
The first that comes to my mind (but there is many) is Notepad++. Simply go in Encoding > Convert to UTF-8 without BOM, save the file and you are set.