How can I partially read a huge CSV file?
I have a very big csv file so that I can not read them all into the memory. I only want to read and process a few lines in it. So I am seeking a function in Pandas which could handle this task, which the basic python can handle this well:
with open('abc.csv') as f:
line = f.readline()
# pass until it reaches a particular line number....
However, if I do this in pandas, I always read the first line:
datainput1 = pd.read_csv('matrix.txt',sep=',', header = None, nrows = 1 )
datainput2 = pd.read_csv('matrix.txt',sep=',', header = None, nrows = 1 )
I am looking for some easier way to handle this task in pandas. For example, if I want to read rows from 1000 to 2000. How can I do this quickly?
I want to use pandas because I want to read data into the dataframe.
Solution 1:
Use chunksize
:
for df in pd.read_csv('matrix.txt',sep=',', header = None, chunksize=1):
#do something
To answer your second part do this:
df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows=1000, chunksize=1000)
This will skip the first 1000 rows and then only read the next 1000 rows giving you rows 1000-2000, unclear if you require the end points to be included or not but you can fiddle the numbers to get what you want.
Solution 2:
In addition to EdChums answer I find the nrows
argument useful which simply defines the number of rows you want to import. Thereby you don't get an iterator but rather can just import a part of the whole file of size nrows
. It works with skiprows
too.
df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows= 1000, nrows=1000)