How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?
I need to create a data frame by reading in data from a file, using read_csv
method. However, the separators are not very regular: some columns are separated by tabs (\t
), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).
Is there a way to tell pandas to treat these files properly?
By the way, I do not have this problem if I use Python. I use:
for line in file(file_name):
fld = line.split()
And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?
From the documentation, you can use either a regex or delim_whitespace
:
>>> import pandas as pd
>>> for line in open("whitespace.csv"):
... print repr(line)
...
'a\t b\tc 1 2\n'
'd\t e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
>>> pd.read_csv("whitespace.csv", header = None, sep = "\s+|\t+|\s+\t+|\t+\s+")
would use any combination of any number of spaces and tabs as the separator.
Pandas has two csv readers, only is flexible regarding redundant leading white space:
pd.read_csv("whitespace.csv", skipinitialspace=True)
while one is not
pd.DataFrame.from_csv("whitespace.csv")
Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or \t) as separators.
We may consider this to take care of all the combination and zero or more occurrences.
pd.read_csv("whitespace.csv", header = None, sep = "[ \t]*,[ \t]*")