What is the most efficient way to loop through dataframes with pandas?
Solution 1:
The newest versions of pandas now include a built-in function for iterating over rows.
for index, row in df.iterrows():
# do some logic here
Or, if you want it faster use itertuples()
But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.
Solution 2:
Pandas is based on NumPy arrays. The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.
For example, if close
is a 1-d array, and you want the day-over-day percent change,
pct_change = close[1:]/close[:-1]
This computes the entire array of percent changes as one statement, instead of
pct_change = []
for row in close:
pct_change.append(...)
So try to avoid the Python loop for i, row in enumerate(...)
entirely, and
think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.