Why does concatenation of DataFrames get exponentially slower?

Solution 1:

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

pd.concat returns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame. Consider the amount of copying required by this line inside the for-loop (assuming each x has size 1):

super_x = pd.concat([super_x, x], axis=0)

| iteration | size of old super_x | size of x | copying required |
|         0 |                   0 |         1 |                1 |
|         1 |                   1 |         1 |                2 |
|         2 |                   2 |         1 |                3 |
|       ... |                     |           |                  |
|       N-1 |                 N-1 |         1 |                N |

1 + 2 + 3 + ... + N = N(N+1)/2. So there is O(N**2) copies required to complete the loop.

Now consider

super_x = []
for i, df_chunk in enumerate(df_list):
    [x, y] = preprocess_data(df_chunk)
    super_x.append(x)
super_x = pd.concat(super_x, axis=0)

Appending to a list is an O(1) operation and does not require copying. Now there is a single call to pd.concat after the loop is done. This call to pd.concat requires N copies to be made, since super_x contains N DataFrames of size 1. So when constructed this way, super_x requires O(N) copies.

Solution 2:

Every time you concatenate, you are returning a copy of the data.

You want to keep a list of your chunks, and then concatenate everything as the final step.

df_x = []
df_y = []
for i, df_chunk in enumerate(df_list):
    print "chunk", i
    [x, y] = preprocess_data(df_chunk)
    df_x.append(x)
    df_y.append(y)

super_x = pd.concat(df_x, axis=0)
del df_x  # Free-up memory.
super_y = pd.concat(df_y, axis=0)
del df_y  # Free-up memory.