pandas.read_csv: how to skip comment lines
I think I misunderstand the intention of read_csv. If I have a file 'j' like
# notes
a,b,c
# more notes
1,2,3
How can I pandas.read_csv this file, skipping any '#' commented lines? I see in the help 'comment' of lines is not supported but it indicates an empty line should be returned. I see an error
df = pandas.read_csv('j', comment='#')
CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3
I'm currently on
In [15]: pandas.__version__
Out[15]: '0.12.0rc1'
On version'0.12.0-199-g4c8ad82':
In [43]: df = pandas.read_csv('j', comment='#', header=None)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3
So I believe in the latest releases of pandas (version 0.16.0), you could throw in the comment='#'
parameter into pd.read_csv
and this should skip commented out lines.
These github issues shows that you can do this:
- https://github.com/pydata/pandas/issues/10548
- https://github.com/pydata/pandas/issues/4623
See the documentation on read_csv
: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
One workaround is to specify skiprows to ignore the first few entries:
In [11]: s = '# notes\na,b,c\n# more notes\n1,2,3'
In [12]: pd.read_csv(StringIO(s), sep=',', comment='#', skiprows=1)
Out[12]:
a b c
0 NaN NaN NaN
1 1 2 3
Otherwise read_csv
gets a little confused:
In [13]: pd.read_csv(StringIO(s), sep=',', comment='#')
Out[13]:
Unnamed: 0
a b c
NaN NaN NaN
1 2 3
This seems to be the case in 0.12.0, I've filed a bug report.
As Viktor points out you can use dropna to remove the NaN after the fact... (there is a recent open issue to have commented lines be ignored completely):
In [14]: pd.read_csv(StringIO(s2), comment='#', sep=',').dropna(how='all')
Out[14]:
a b c
1 1 2 3
Note: the default index will "give away" the fact there was missing data.