How do I read a large csv file with pandas?
Solution 1:
The error shows that the machine does not have enough memory to read the entire
CSV into a DataFrame at one time. Assuming you do not need the entire dataset in
memory all at one time, one way to avoid the problem would be to process the CSV in
chunks (by specifying the chunksize
parameter):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
The chunksize
parameter specifies the number of rows per chunk.
(The last chunk may contain fewer than chunksize
rows, of course.)
pandas >= 1.2
read_csv
with chunksize
returns a context manager, to be used like so:
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
for chunk in reader:
process(chunk)
See GH38225
Solution 2:
Chunking shouldn't always be the first port of call for this problem.
-
Is the file large due to repeated non-numeric data or unwanted columns?
If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv
usecols
parameter. -
Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
-
If all else fails, read line by line via chunks.
Chunk via pandas or via csv library as a last resort.
Solution 3:
For large data l recommend you use the library "dask"
e.g:
# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')
You can read more from the documentation here.
Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.
From my projects another superior library is datatables.
# Datatable python library
import datatable as dt
df = dt.fread("s3://.../2018-*-*.csv")
Solution 4:
I proceeded like this:
chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
names=['lat','long','rf','date','slno'],index_col='slno',\
header=None,parse_dates=['date'])
df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)