What is the difference between join and merge in Pandas?
Suppose I have two DataFrames like so:
left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})
I want to merge them, so I try something like this:
pd.merge(left, right, left_on='key1', right_on='key2')
And I'm happy
key1 lval key2 rval
0 foo 1 foo 4
1 bar 2 bar 5
But I'm trying to use the join method, which I've been lead to believe is pretty similar.
left.join(right, on=['key1', 'key2'])
And I get this:
//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
406 if self.right_index:
407 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408 raise AssertionError()
409 self.right_on = [None] * n
410 elif self.right_on is not None:
AssertionError:
What am I missing?
pandas.merge()
is the underlying function used for all merge/join behavior.
DataFrames provide the pandas.DataFrame.merge()
and pandas.DataFrame.join()
methods as a convenient way to access the capabilities of pandas.merge()
. For example, df1.merge(right=df2, ...)
is equivalent to pandas.merge(left=df1, right=df2, ...)
.
These are the main differences between df.join()
and df.merge()
:
- lookup on right table:
df1.join(df2)
always joins via the index ofdf2
, butdf1.merge(df2)
can join to one or more columns ofdf2
(default) or to the index ofdf2
(withright_index=True
). - lookup on left table: by default,
df1.join(df2)
uses the index ofdf1
anddf1.merge(df2)
uses column(s) ofdf1
. That can be overridden by specifyingdf1.join(df2, on=key_or_keys)
ordf1.merge(df2, left_index=True)
. - left vs inner join:
df1.join(df2)
does a left join by default (keeps all rows ofdf1
), butdf.merge
does an inner join by default (returns only matching rows ofdf1
anddf2
).
So, the generic approach is to use pandas.merge(df1, df2)
or df1.merge(df2)
. But for a number of common situations (keeping all rows of df1
and joining to an index in df2
), you can save some typing by using df1.join(df2)
instead.
Some notes on these issues from the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging:
merge
is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.The related
DataFrame.join
method, usesmerge
internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior formerge
). If you are joining on index, you may wish to useDataFrame.join
to save yourself some typing.
...
These two function calls are completely equivalent:
left.join(right, on=key_or_keys) pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)
I always use join
on indices:
import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')
val_l val_r
key
foo 1 4
bar 2 5
The same functionality can be had by using merge
on the columns follows:
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))
key val_l val_r
0 foo 1 4
1 bar 2 5
From this documentation
pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects:
merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False)
And :
DataFrame.join
is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. Here is a very basic example: The data alignment here is on the indexes (row labels). This same behavior can be achieved using merge plus additional arguments instructing it to use the indexes:result = pd.merge(left, right, left_index=True, right_index=True, how='outer')