Improve Row Append Performance On Pandas DataFrames
I also used the dataframe's append function inside a loop and I was perplexed how slow it ran.
A useful example for those who are suffering, based on the correct answer on this page.
Python version: 3
Pandas version: 0.20.3
# the dictionary to pass to pandas dataframe
d = {}
# a counter to use to add entries to "dict"
i = 0
# Example data to loop and append to a dataframe
data = [{"foo": "foo_val_1", "bar": "bar_val_1"},
{"foo": "foo_val_2", "bar": "bar_val_2"}]
# the loop
for entry in data:
# add a dictionary entry to the final dictionary
d[i] = {"col_1_title": entry['foo'], "col_2_title": entry['bar']}
# increment the counter
i = i + 1
# create the dataframe using 'from_dict'
# important to set the 'orient' parameter to "index" to make the keys as rows
df = DataFrame.from_dict(d, "index")
The "from_dict" function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html
Appending rows to lists is far more efficient than to a DataFrame
.
Hence you would want to
- append the rows to a list.
- Then convert it into
DataFrame
and - set the index as required.
I think the best way to do it is, if you know the data you are going to receive, allocate before hand.
import numpy as np
import pandas as pd
random_matrix = np.random.randn(100, 100)
insert_df = pd.DataFrame(random_matrix)
df = pd.DataFrame(columns=range(100), index=range(200))
df.loc[range(100), df.columns] = random_matrix
df.loc[range(100, 200), df.columns] = random_matrix
This is the pattern that I think makes the most sense. append
will be faster if
you have a very small dataframe, but it doesn't scale.
In [1]: import numpy as np; import pandas as pd
In [2]: random_matrix = np.random.randn(100, 100)
...: insert_df = pd.DataFrame(random_matrix)
...: df = pd.DataFrame(np.random.randn(100, 100))
In [2]: %timeit df.append(insert_df)
272 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
493 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: %timeit df.loc[range(100), df.columns] = insert_df
821 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
When we run this with a 100,000 row dataframe, we see much more dramatic results.
In [1]: df = pd.DataFrame(np.random.randn(100_000, 100))
In [2]: %timeit df.append(insert_df)
17.9 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
465 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: %timeit df.loc[range(99_900, 100_000), df.columns] = random_matrix
465 µs ± 5.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit df.loc[range(99_900, 100_000), df.columns] = insert_df
1.02 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So we can see an append is about 17 times slower than an insert with a dataframe, and 35 times slower than an insert with a numpy array.
Another way is to make it into a list and then use pd.concat
import pandas as pd
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
def append(df):
df_out = df.copy()
for i in range(1000):
df_out = df_out.append(df)
return df_out
def concat(df):
df_list = []
for i in range(1001):
df_list.append(df)
return pd.concat(df_list)
# some testing
df2 = concat(df)
df3 = append(df)
pd.testing.assert_frame_equal(df2,df3)
%timeit concat(df)
:
20.2 ms ± 794 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit append(df)
275 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It is the recommended way to concatenate rows in pandas now:
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once. link