Pandas dataframe read_csv on bad data

Solution 1:

pass error_bad_lines=False to skip erroneous rows:

error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)

Solution 2:

To get information about error causing rows try to use combination of error_bad_lines=False and warn_bad_lines=True:

dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000,
                        warn_bad_lines=True, error_bad_lines=False)

error_bad_lines=False skips error-causing rows and warn_bad_lines=True prints error details and row number, like this:

'Skipping line 3: expected 4 fields, saw 3401\nSkipping line 4: expected 4 fields, saw 30...'

If you want to save the warning message (i.e. for some further processing), then you can save it to a file too (with use of contextlib):

import contextlib

with open(r'D:\Temp\log.txt', 'w') as log:
    with contextlib.redirect_stderr(log):
        dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', 
                                warn_bad_lines=True, error_bad_lines=False)

Solution 3:

For anyone like me who ran across this years later than the original was posted, the other answers suggest using error_bad_lines=False and warn_bad_lines=True, but both are being deprecated in pandas.

Instead, use on_bad_lines = 'warn' to achieve the same effect to skip over bad data lines.

dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000, on_bad_lines = 'warn')

on_bad_lines = 'warn' will raise a warning when a bad line is encountered and skip that line.


Other acceptable values for on_bad_lines are

  • 'error' which raises an Exception on a bad line
  • 'skip' which will skip any bad lines