Improve Pandas Merge performance

Solution 1:

set_index on merging column does indeed speed this up. Below is a slightly more realistic version of julien-marrec's Answer.

import pandas as pd
import numpy as np
myids=np.random.choice(np.arange(10000000), size=1000000, replace=False)
df1 = pd.DataFrame(myids, columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

%%timeit
    x = df1.merge(df2, how='left', left_on='A', right_on='A2')   
#1 loop, best of 3: 664 ms per loop

%%timeit  
    x = df1.set_index('A').join(df2.set_index('A2'), how='left') 
#1 loop, best of 3: 354 ms per loop

%%time 
    df1.set_index('A', inplace=True)
    df2.set_index('A2', inplace=True)
#Wall time: 16 ms

%%timeit
    x = df1.join(df2, how='left')  
#10 loops, best of 3: 80.4 ms per loop

When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.

Solution 2:

I suggest that you set your merge columns as index, and use df1.join(df2) instead of merge, it's much faster.

Here's some example including profiling:

In [1]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(1000000), columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.arange(1000000), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

Here's a regular left merge on A and A2:

In [2]: %%timeit
        x = df1.merge(df2, how='left', left_on='A', right_on='A2')

1 loop, best of 3: 441 ms per loop

Here's the same, using join:

In [3]: %%timeit
        x = df1.set_index('A').join(df2.set_index('A2'), how='left')

1 loop, best of 3: 184 ms per loop

Now obviously if you can set the index before looping, the gain in terms of time will be much greater:

# Do this before looping
In [4]: %%time
df1.set_index('A', inplace=True)
df2.set_index('A2', inplace=True)

CPU times: user 9.78 ms, sys: 9.31 ms, total: 19.1 ms
Wall time: 16.8 ms

Then in the loop, you'll get something that in this case is 30 times faster:

In [5]: %%timeit
        x = df1.join(df2, how='left')
100 loops, best of 3: 14.3 ms per loop

Solution 3:

I don't know if this deserved a new answer but personally, the following tricks helped me improve a bit more the joins I had to do on big DataFrames (millions of rows and hundreds of columns):

  1. Beside using set_index(index, inplace=True), you may want to sort it using sort_index(inplace=True). This speeds up a lot the join if your index is not ordered. For example, creating the DataFrames with
import random
import pandas as pd
import numpy as np

nbre_items = 100000

ids = np.arange(nbre_items)
random.shuffle(ids)

df1 = pd.DataFrame({"id": ids})
df1['value'] = 1
df1.set_index("id", inplace=True)

random.shuffle(ids)

df2 = pd.DataFrame({"id": ids})
df2['value2'] = 2
df2.set_index("id", inplace=True)

I got the following results:

%timeit df1.join(df2)
13.2 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And after sorting the index (which takes a limited amount of time):

df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
%timeit df1.join(df2)
764 µs ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  1. You can split one of your DataFrames in multiple ones with fewer columns. This trick gave me mixed results so be cautious when using it. For example:
for i in range(0, df2.shape[1], 100):
    df1 = df1.join(df2.iloc[:, i:min(df2.shape[1], (i + 100))], how='outer')