How can I filter lines on load in Pandas read_csv function?
How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv
. Am I missing something?
Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.
There isn't an option to filter the rows before the CSV file is loaded into a pandas object.
You can either load the file and then filter using df[df['field'] > constant]
, or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
You can vary the chunksize
to suit your available memory. See here for more details.
I didn't find a straight-forward way to do it within context of read_csv
. However, read_csv
returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]
:
filtered = df[(df['timestamp'] > targettime)]
This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv
call, that at least contains a datetime column timestamp
) for which the values in the timestamp
column are greater than the value of targettime. Similar question.
If the filtered range is contiguous (as it usually is with time(stamp) filters), then the fastest solution is to hard-code the range of rows. Simply combine skiprows=range(1, start_row)
with nrows=end_row
parameters. Then the import takes seconds where the accepted solution would take minutes. A few experiments with the initial start_row
are not a huge cost given the savings on import times. Notice we kept header row by using range(1,..)
.
An alternative to the accepted answer is to apply read_csv() to a StringIO, obtained by filtering the input file.
with open(<file>) as f:
text = "\n".join([line for line in f if <condition>])
df = pd.read_csv(StringIO(text))
This solution is often faster than the accepted answer when the filtering condition retains only a small portion of the lines