Difference(s) between merge() and concat() in pandas
What's the essential difference(s) between pd.DataFrame.merge()
and pd.concat()
?
So far, this is what I found, please comment on how complete and accurate my understanding is:
.merge()
can only use columns (plus row-indices) and it is semantically suitable for database-style operations..concat()
can be used with either axis, using only indices, and gives the option for adding a hierarchical index.Incidentally, this allows for the following redundancy: both can combine two dataframes using the rows indices.
pd.DataFrame.join()
merely offers a shorthand for a subset of the use cases of.merge()
(Pandas is great at addressing a very wide spectrum of use cases in data analysis. It can be a bit daunting exploring the documentation to figure out what is the best way to perform a particular task. )
A very high level difference is that merge()
is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True
and/or right_index=True
), and concat()
is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis
option is set to 0 or 1).
join()
is used to merge 2 dataframes on the basis of the index; instead of using merge()
with the option left_index=True
we can use join()
.
For example:
df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df1:
Key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})
df2:
Key data2
0 a 0
1 b 1
2 d 2
#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is
# a common column in 2 dataframes
pd.merge(df1, df2)
Key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
#Concat
# df2 dataframe is appended at the bottom of df1
pd.concat([df1, df2])
Key data1 data2
0 b 0 NaN
1 b 1 NaN
2 a 2 NaN
3 c 3 NaN
4 a 4 NaN
5 a 5 NaN
6 b 6 NaN
0 a Nan 0
1 b Nan 1
2 d Nan 2
At a high level:
-
.concat()
simply stacks multipleDataFrame
together either vertically, or stitches horizontally after aligning on index -
.merge()
first aligns twoDataFrame
' selected common column(s) or index, and then pick up the remaining columns from the aligned rows of eachDataFrame
.
More specifically, .concat()
:
- Is a top-level pandas function
- Combines two or more pandas
DataFrame
vertically or horizontally - Aligns only on the index when combining horizontally
- Errors when any of the
DataFrame
contains a duplicate index. - Defaults to outer join with the option for inner join
And .merge()
:
- Exists both as a top-level pandas function and a
DataFrame
method (as of pandas 1.0) - Combines exactly two
DataFrame
horizontally - Aligns the calling
DataFrame
's column(s) or index with the otherDataFrame
's column(s) or index - Handles duplicate values on the joining columns or index by performing a cartesian product
- Defaults to inner join with options for left, outer, and right
Note that when performing pd.merge(left, right)
, if left
has two rows containing the same values from the joining columns or index, each row will combine with right
's corresponding row(s) resulting in a cartesian product. On the other hand, if .concat()
is used to combine columns, we need to make sure no duplicated index exists in either DataFrame
.
Practically speaking:
- Consider
.concat()
first when combining homogeneousDataFrame
, while consider.merge()
first when combining complementaryDataFrame
. - If need to merge vertically, go with
.concat()
. If need to merge horizontally via columns, go with.merge()
, which by default merge on the columns in common.
Reference: Pandas 1.x Cookbook
pd.concat
takes an Iterable
as its argument. Hence, it cannot take DataFrame
s directly as its argument. Also Dimension
s of the DataFrame
should match along axis while concatenating.
pd.merge
can take DataFrame
s as its argument, and is used to combine two DataFrame
s with same columns or index, which can't be done with pd.concat
since it will show the repeated column in the DataFrame.
Whereas join can be used to join two DataFrame
s with different indices.
I am currently trying to understand the essential difference(s) between
pd.DataFrame.merge()
andpd.concat()
.
Nice question. The main difference:
pd.concat
works on both axes.
The other difference, is pd.concat
has innerdefault and outer joins only, while pd.DataFrame.merge()
has left, right, outer, innerdefault joins.
Third notable other difference is: pd.DataFrame.merge()
has the option to set the column suffixes when merging columns with the same name, while for pd.concat
this is not possible.
With pd.concat
by default you are able to stack rows of multiple dataframes (axis=0
) and when you set the axis=1
then you mimic the pd.DataFrame.merge()
function.
Some useful examples of pd.concat
:
df2=pd.concat([df]*2, ignore_index=True) #double the rows of a dataframe
df2=pd.concat([df, df.iloc[[0]]]) # add first row to the end
df3=pd.concat([df1,df2], join='inner', ignore_index=True) # concat two df's
The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.
Merge
Referring the documentation, pd.DataFrame.merge
takes right as a required argument, which you can think it as joining left table and right table according to some pre-defined structured join operation. Note the definition for parameter right.
Required Parameters
- right: DataFrame or named Series
Optional Parameters
- how: {‘left’, ‘right’, ‘outer’, ‘inner’} default ‘inner’
- on: label or list
- left_on: label or list, or array-like
- right_on: label or list, or array-like
- left_index: bool, default False
- right_index: bool, default False
- sort: bool, default False
- suffixes: tuple of (str, str), default (‘_x’, ‘_y’)
- copy: bool, default True
- indicator: bool or str, default False
- validate: str, optional
Important: pd.DataFrame.merge
requires right to be a pd.DataFrame
or named pd.Series
object.
Output
- Returns: DataFrame
Furthermore, if we check the docstring for Merge Operation on pandas is below:
Perform a database (SQL) merge operation between two DataFrame or Series objects using either columns as keys or their row indexes
Concat
Refer to documentation of pd.concat
, first note that the parameter is not named any of table, data_frame, series, matrix, etc., but objs instead. That is, you can pass many "data containers", which are defined as:
Iterable[FrameOrSeriesUnion], Mapping[Optional[Hashable], FrameOrSeriesUnion]
Required Parameters
- objs: a sequence or mapping of Series or DataFrame objects
Optional Parameters
- axis: {0/’index’, 1/’columns’}, default 0
- join: {‘inner’, ‘outer’}, default ‘outer’
- ignore_index: bool, default False
- keys: sequence, default None
- levels: list of sequences, default None
- names: list, default None
- verify_integrity: bool, default False
- sort: bool, default False
- copy: bool, default True
Output
- Returns: object, type of objs
Example
Code
import pandas as pd
v1 = pd.Series([1, 5, 9, 13])
v2 = pd.Series([10, 100, 1000, 10000])
v3 = pd.Series([0, 1, 2, 3])
df_left = pd.DataFrame({
"v1": v1,
"v2": v2,
"v3": v3
})
df_right = pd.DataFrame({
"v4": [5, 5, 5, 5],
"v5": [3, 2, 1, 0]
})
df_concat = pd.concat([v1, v2, v3])
# Performing operations on default
merge_result = df_left.merge(df_right, left_index=True, right_index=True)
concat_result = pd.concat([df_left, df_right], sort=False)
print(merge_result)
print('='*20)
print(concat_result)
Code Output
v1 v2 v3 v4 v5
0 1 10 0 5 3
1 5 100 1 5 2
2 9 1000 2 5 1
3 13 10000 3 5 0
====================
v1 v2 v3 v4 v5
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
You can achieve, however, the first output (merge) with concat by changing the axis parameter
concat_result = pd.concat([df_left, df_right], sort=False, axis=1)
Observe the following behavior,
concat_result = pd.concat([df_left, df_right, df_left, df_right], sort=False)
outputs;
v1 v2 v3 v4 v5
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
3 NaN NaN NaN 5.0 0.0
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
3 NaN NaN NaN 5.0 0.0
, which you cannot perform a similar operation with merge, since it only allows a single DataFrame or named Series.
merge_result = df_left.merge([df_right, df_left, df_right], left_index=True, right_index=True)
outputs;
TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed
Conclusion
As you may have notice already that input and outputs may be different between "merge" and "concat".
As I mentioned at the beginning, the very first (main) difference is that "merge" performs a more structured join with a set of restricted set of objects and parameters where as "concat" performs a less strict/broader join with a broader set of objects and parameters.
All in all, merge is less tolerant to changes/(the input) and "concat" is looser/less sensitive to changes/(the input). You can achieve "merge" by using "concat", but the reverse is not always true.
"Merge" operation uses Data Frame columns (or name of pd.Series
object) or row indices, and since it uses those entities only it performs horizontal merge of Data Frames or Series, and does not apply vertical operation as a result.
If you want to see more, you can deep dive in the source code a bit;
- Source code for concat
- Source code for merge