Pandas dataframe read_csv on bad data
Solution 1:
pass error_bad_lines=False
to skip erroneous rows:
error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)
Solution 2:
To get information about error causing rows try to use combination of error_bad_lines=False
and warn_bad_lines=True
:
dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000,
warn_bad_lines=True, error_bad_lines=False)
error_bad_lines=False
skips error-causing rows and warn_bad_lines=True
prints error details and row number, like this:
'Skipping line 3: expected 4 fields, saw 3401\nSkipping line 4: expected 4 fields, saw 30...'
If you want to save the warning message (i.e. for some further processing), then you can save it to a file too (with use of contextlib
):
import contextlib
with open(r'D:\Temp\log.txt', 'w') as log:
with contextlib.redirect_stderr(log):
dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1',
warn_bad_lines=True, error_bad_lines=False)
Solution 3:
For anyone like me who ran across this years later than the original was posted, the other answers suggest using error_bad_lines=False
and warn_bad_lines=True
, but both are being deprecated in pandas.
Instead, use on_bad_lines = 'warn'
to achieve the same effect to skip over bad data lines.
dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000, on_bad_lines = 'warn')
on_bad_lines = 'warn'
will raise a warning when a bad line is encountered and skip that line.
Other acceptable values for on_bad_lines
are
- 'error' which raises an Exception on a bad line
- 'skip' which will skip any bad lines