What are the 'levels', 'keys', and names arguments for in Pandas' concat function?
Questions
- How do I use
pd.concat
? - What is the
levels
argument for? - What is the
keys
argument for? - Are there a bunch of examples to help explain how to use all the arguments?
Pandas' concat
function is the Swiss Army knife of the merging utilities. The variety of situations in which it is useful are numerous. The existing documentation leaves out a few details on some of the optional arguments. Among them are the levels
and keys
arguments. I set out to figure out what those arguments do.
I'll pose a question that will act as a gateway into many aspects of pd.concat
.
Consider the data frames d1
, d2
, and d3
:
import pandas as pd
d1 = pd.DataFrame(dict(A=.1, B=.2, C=.3), [2, 3])
d2 = pd.DataFrame(dict(B=.4, C=.5, D=.6), [1, 2])
d3 = pd.DataFrame(dict(A=.7, B=.8, D=.9), [1, 3])
If I were to concatenate these together with
pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'])
I get the expected result with a pandas.MultiIndex
for my columns
object:
A B C D
d1 2 0.1 0.2 0.3 NaN
3 0.1 0.2 0.3 NaN
d2 1 NaN 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6
d3 1 0.7 0.8 NaN 0.9
3 0.7 0.8 NaN 0.9
However, I wanted to use the levels
argument documentation:
levels: list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex. Otherwise, they will be inferred from the keys.
So I passed
pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'], levels=[['d1', 'd2']])
And get a KeyError
ValueError: Key d3 not in level Index(['d1', 'd2'], dtype='object')
This made sense. The levels I passed were inadequate to describe the necessary levels indicated by the keys. Had I not passed anything, as I did above, the levels are inferred (as stated in the documentation). But how else can I use this argument to better effect?
If I tried this instead:
pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'], levels=[['d1', 'd2', 'd3']])
I and got the same results as above. But when I add one more value to the levels,
df = pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'], levels=[['d1', 'd2', 'd3', 'd4']])
I end up with the same looking data frame, but the resulting MultiIndex
has an unused level.
df.index.levels[0]
Index(['d1', 'd2', 'd3', 'd4'], dtype='object')
So what is the point of the level
argument and should I be using keys
differently?
I'm using Python 3.6 and Pandas 0.22.
In the process of answering this question for myself, I learned many things, and I wanted to put together a catalog of examples and some explanation.
The specific answer to the point of the levels
argument will come towards the end.
pandas.concat
: The Missing Manual
Link To Current Documentation
Imports and defining objects
import pandas as pd
d1 = pd.DataFrame(dict(A=.1, B=.2, C=.3), index=[2, 3])
d2 = pd.DataFrame(dict(B=.4, C=.5, D=.6), index=[1, 2])
d3 = pd.DataFrame(dict(A=.7, B=.8, D=.9), index=[1, 3])
s1 = pd.Series([1, 2], index=[2, 3])
s2 = pd.Series([3, 4], index=[1, 2])
s3 = pd.Series([5, 6], index=[1, 3])
Arguments
objs
The first argument we come across is objs
:
objs: a sequence or mapping of Series, DataFrame, or Panel objects If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised
- We typically see this used with a list of
Series
orDataFrame
objects. - I'll show that
dict
can be very useful as well. - Generators may also be used and can be useful when using
map
as inmap(f, list_of_df)
For now, we'll stick with a list of some of the DataFrame
and Series
objects defined above.
I'll show how dictionaries can be leveraged to give very useful MultiIndex
results later.
pd.concat([d1, d2])
A B C D
2 0.1 0.2 0.3 NaN
3 0.1 0.2 0.3 NaN
1 NaN 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6
axis
The second argument we encounter is axis
whose default value is 0
:
axis: {0/’index’, 1/’columns’}, default 0 The axis to concatenate along.
Two DataFrame
s with axis=0
(stacked)
For values of 0
or index
we mean to say: "Align along the columns and add to the index".
As shown above where we used axis=0
, because 0
is the default value, and we see that the index of d2
extends the index of d1
despite there being overlap of the value 2
:
pd.concat([d1, d2], axis=0)
A B C D
2 0.1 0.2 0.3 NaN
3 0.1 0.2 0.3 NaN
1 NaN 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6
Two DataFrame
s with axis=1
(side by side)
For values 1
or columns
we mean to say: "Align along the index and add to the columns",
pd.concat([d1, d2], axis=1)
A B C B C D
1 NaN NaN NaN 0.4 0.5 0.6
2 0.1 0.2 0.3 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN NaN NaN
We can see that the resulting index is the union of indices and the resulting columns are the extension of columns from d1
by the columns of d2
.
Two (or Three) Series
with axis=0
(stacked)
When combining pandas.Series
along axis=0
, we get back a pandas.Series
. The name of the resulting Series
will be None
unless all Series
being combined have the same name. Pay attention to the 'Name: A'
when we print out the resulting Series
. When it isn't present, we can assume the Series
name is None
.
| | | pd.concat(
| pd.concat( | pd.concat( | [s1.rename('A'),
pd.concat( | [s1.rename('A'), | [s1.rename('A'), | s2.rename('B'),
[s1, s2]) | s2]) | s2.rename('A')]) | s3.rename('A')])
-------------- | --------------------- | ---------------------- | ----------------------
2 1 | 2 1 | 2 1 | 2 1
3 2 | 3 2 | 3 2 | 3 2
1 3 | 1 3 | 1 3 | 1 3
2 4 | 2 4 | 2 4 | 2 4
dtype: int64 | dtype: int64 | Name: A, dtype: int64 | 1 5
| | | 3 6
| | | dtype: int64
Two (or Three) Series
with axis=1
(side by side)
When combining pandas.Series
along axis=1
, it is the name
attribute that we refer to in order to infer a column name in the resulting pandas.DataFrame
.
| | pd.concat(
| pd.concat( | [s1.rename('X'),
pd.concat( | [s1.rename('X'), | s2.rename('Y'),
[s1, s2], axis=1) | s2], axis=1) | s3.rename('Z')], axis=1)
---------------------- | --------------------- | ------------------------------
0 1 | X 0 | X Y Z
1 NaN 3.0 | 1 NaN 3.0 | 1 NaN 3.0 5.0
2 1.0 4.0 | 2 1.0 4.0 | 2 1.0 4.0 NaN
3 2.0 NaN | 3 2.0 NaN | 3 2.0 NaN 6.0
Mixed Series
and DataFrame
with axis=0
(stacked)
When performing a concatenation of a Series
and DataFrame
along axis=0
, we convert all Series
to single column DataFrame
s.
Take special note that this is a concatenation along axis=0
; that means extending the index (rows) while aligning the columns. In the examples below, we see the index becomes [2, 3, 2, 3]
which is an indiscriminate appending of indices. The columns do not overlap unless I force the naming of the Series
column with the argument to to_frame
:
pd.concat( |
[s1.to_frame(), d1]) | pd.concat([s1, d1])
------------------------- | ---------------------
0 A B C | 0 A B C
2 1.0 NaN NaN NaN | 2 1.0 NaN NaN NaN
3 2.0 NaN NaN NaN | 3 2.0 NaN NaN NaN
2 NaN 0.1 0.2 0.3 | 2 NaN 0.1 0.2 0.3
3 NaN 0.1 0.2 0.3 | 3 NaN 0.1 0.2 0.3
You can see the results of pd.concat([s1, d1])
are the same as if I had perfromed the to_frame
myself.
However, I can control the name of the resulting column with a parameter to to_frame
. Renaming the Series
with the rename
method does not control the column name in the resulting DataFrame
.
# Effectively renames | |
# `s1` but does not align | # Does not rename. So | # Renames to something
# with columns in `d1` | # Pandas defaults to `0` | # that does align with `d1`
pd.concat( | pd.concat( | pd.concat(
[s1.to_frame('X'), d1]) | [s1.rename('X'), d1]) | [s1.to_frame('B'), d1])
---------------------------- | -------------------------- | ----------------------------
A B C X | 0 A B C | A B C
2 NaN NaN NaN 1.0 | 2 1.0 NaN NaN NaN | 2 NaN 1.0 NaN
3 NaN NaN NaN 2.0 | 3 2.0 NaN NaN NaN | 3 NaN 2.0 NaN
2 0.1 0.2 0.3 NaN | 2 NaN 0.1 0.2 0.3 | 2 0.1 0.2 0.3
3 0.1 0.2 0.3 NaN | 3 NaN 0.1 0.2 0.3 | 3 0.1 0.2 0.3
Mixed Series
and DataFrame
with axis=1
(side by side)
This is fairly intuitive. Series
column name defaults to an enumeration of such Series
objects when a name
attribute is not available.
| pd.concat(
pd.concat( | [s1.rename('X'),
[s1, d1], | s2, s3, d1],
axis=1) | axis=1)
------------------- | -------------------------------
0 A B C | X 0 1 A B C
2 1 0.1 0.2 0.3 | 1 NaN 3.0 5.0 NaN NaN NaN
3 2 0.1 0.2 0.3 | 2 1.0 4.0 NaN 0.1 0.2 0.3
| 3 2.0 NaN 6.0 0.1 0.2 0.3
join
The third argument is join
that describes whether the resulting merge should be an outer merge (default) or an inner merge.
join: {‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis(es).
It turns out, there is no left
or right
option as pd.concat
can handle more than just two objects to merge.
In the case of d1
and d2
, the options look like:
outer
pd.concat([d1, d2], axis=1, join='outer')
A B C B C D
1 NaN NaN NaN 0.4 0.5 0.6
2 0.1 0.2 0.3 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN NaN NaN
inner
pd.concat([d1, d2], axis=1, join='inner')
A B C B C D
2 0.1 0.2 0.3 0.4 0.5 0.6
join_axes
Fourth argument is the thing that allows us to do our left
merge and more.
join_axes: list of Index objects
Specific indexes to use for the other n - 1 axes instead of performing inner/outer set logic.
Left Merge
pd.concat([d1, d2, d3], axis=1, join_axes=[d1.index])
A B C B C D A B D
2 0.1 0.2 0.3 0.4 0.5 0.6 NaN NaN NaN
3 0.1 0.2 0.3 NaN NaN NaN 0.7 0.8 0.9
Right Merge
pd.concat([d1, d2, d3], axis=1, join_axes=[d3.index])
A B C B C D A B D
1 NaN NaN NaN 0.4 0.5 0.6 0.7 0.8 0.9
3 0.1 0.2 0.3 NaN NaN NaN 0.7 0.8 0.9
ignore_index
ignore_index: boolean, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
Like when I stack d1
on top of d2
, if I don't care about the index values, I could reset them or ignore them.
| pd.concat( | pd.concat(
| [d1, d2], | [d1, d2]
pd.concat([d1, d2]) | ignore_index=True) | ).reset_index(drop=True)
--------------------- | ----------------------- | -------------------------
A B C D | A B C D | A B C D
2 0.1 0.2 0.3 NaN | 0 0.1 0.2 0.3 NaN | 0 0.1 0.2 0.3 NaN
3 0.1 0.2 0.3 NaN | 1 0.1 0.2 0.3 NaN | 1 0.1 0.2 0.3 NaN
1 NaN 0.4 0.5 0.6 | 2 NaN 0.4 0.5 0.6 | 2 NaN 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6 | 3 NaN 0.4 0.5 0.6 | 3 NaN 0.4 0.5 0.6
And when using axis=1
:
| pd.concat(
| [d1, d2], axis=1,
pd.concat([d1, d2], axis=1) | ignore_index=True)
------------------------------- | -------------------------------
A B C B C D | 0 1 2 3 4 5
1 NaN NaN NaN 0.4 0.5 0.6 | 1 NaN NaN NaN 0.4 0.5 0.6
2 0.1 0.2 0.3 0.4 0.5 0.6 | 2 0.1 0.2 0.3 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN NaN NaN | 3 0.1 0.2 0.3 NaN NaN NaN
keys
We can pass a list of scalar values or tuples in order to assign tuple or scalar values to corresponding MultiIndex. The length of the passed list must be the same length as the number of items being concatenated.
keys: sequence, default None
If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level
axis=0
When concatenating Series
objects along axis=0
(extending the index).
Those keys, become a new initial level of a MultiIndex
object in the index attribute.
# length 3 length 3 # length 2 length 2
# /--------\ /-----------\ # /----\ /------\
pd.concat([s1, s2, s3], keys=['A', 'B', 'C']) pd.concat([s1, s2], keys=['A', 'B'])
---------------------------------------------- -------------------------------------
A 2 1 A 2 1
3 2 3 2
B 1 3 B 1 3
2 4 2 4
C 1 5 dtype: int64
3 6
dtype: int64
However, we can use more than scalar values in the keys
argument to create an even deeper MultiIndex
. Here we pass tuples
of length 2 the prepend two new levels of a MultiIndex
:
pd.concat(
[s1, s2, s3],
keys=[('A', 'X'), ('A', 'Y'), ('B', 'X')])
-----------------------------------------------
A X 2 1
3 2
Y 1 3
2 4
B X 1 5
3 6
dtype: int64
axis=1
It's a bit different when extending along columns. When we used axis=0
(see above) our keys
acted as MultiIndex
levels in addition to the existing index. For axis=1
, we are referring to an axis that Series
objects don't have, namely the columns
attribute.
Series
wtih axis=1
Notice that naming the s1
and s2
matters so long as no keys
are passed, but it gets overridden if keys
are passed.
| | | pd.concat(
| pd.concat( | pd.concat( | [s1.rename('U'),
pd.concat( | [s1, s2], | [s1.rename('U'), | s2.rename('V')],
[s1, s2], | axis=1, | s2.rename('V')], | axis=1,
axis=1) | keys=['X', 'Y']) | axis=1) | keys=['X', 'Y'])
-------------- | --------------------- | ---------------------- | ----------------------
0 1 | X Y | U V | X Y
1 NaN 3.0 | 1 NaN 3.0 | 1 NaN 3.0 | 1 NaN 3.0
2 1.0 4.0 | 2 1.0 4.0 | 2 1.0 4.0 | 2 1.0 4.0
3 2.0 NaN | 3 2.0 NaN | 3 2.0 NaN | 3 2.0 NaN
MultiIndex
with Series
and axis=1
pd.concat(
[s1, s2],
axis=1,
keys=[('W', 'X'), ('W', 'Y')])
-----------------------------------
W
X Y
1 NaN 3.0
2 1.0 4.0
3 2.0 NaN
Two DataFrame
with axis=1
As with the axis=0
examples, keys
add levels to a MultiIndex
, but this time to the object stored in the columns
attribute.
pd.concat( | pd.concat(
[d1, d2], | [d1, d2],
axis=1, | axis=1,
keys=['X', 'Y']) | keys=[('First', 'X'), ('Second', 'X')])
------------------------------- | --------------------------------------------
X Y | First Second
A B C B C D | X X
1 NaN NaN NaN 0.4 0.5 0.6 | A B C B C D
2 0.1 0.2 0.3 0.4 0.5 0.6 | 1 NaN NaN NaN 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN NaN NaN | 2 0.1 0.2 0.3 0.4 0.5 0.6
| 3 0.1 0.2 0.3 NaN NaN NaN
Series
and DataFrame
with axis=1
This is tricky. In this case, a scalar key value cannot act as the only level of index for the Series
object when it becomes a column while also acting as the first level of a MultiIndex
for the DataFrame
. So Pandas will again use the name
attribute of the Series
object as the source of the column name.
pd.concat( | pd.concat(
[s1, d1], | [s1.rename('Z'), d1],
axis=1, | axis=1,
keys=['X', 'Y']) | keys=['X', 'Y'])
--------------------- | --------------------------
X Y | X Y
0 A B C | Z A B C
2 1 0.1 0.2 0.3 | 2 1 0.1 0.2 0.3
3 2 0.1 0.2 0.3 | 3 2 0.1 0.2 0.3
Limitations of keys
and MultiIndex
inferrence.
Pandas only seems to infer column names from Series
name, but it will not fill in the blanks when doing an analogous concatenation among data frames with a different number of column levels.
d1_ = pd.concat(
[d1], axis=1,
keys=['One'])
d1_
One
A B C
2 0.1 0.2 0.3
3 0.1 0.2 0.3
Then concatenate this with another data frame with only one level in the columns object and Pandas will refuse to try and make tuples of the MultiIndex
object and combine all data frames as if a single level of objects, scalars and tuples.
pd.concat([d1_, d2], axis=1)
(One, A) (One, B) (One, C) B C D
1 NaN NaN NaN 0.4 0.5 0.6
2 0.1 0.2 0.3 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN NaN NaN
Passing a dict
instead of a list
When passing a dictionary, pandas.concat
will use the keys from the dictionary as the keys
parameter.
# axis=0 | # axis=1
pd.concat( | pd.concat(
{0: d1, 1: d2}) | {0: d1, 1: d2}, axis=1)
----------------------- | -------------------------------
A B C D | 0 1
0 2 0.1 0.2 0.3 NaN | A B C B C D
3 0.1 0.2 0.3 NaN | 1 NaN NaN NaN 0.4 0.5 0.6
1 1 NaN 0.4 0.5 0.6 | 2 0.1 0.2 0.3 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6 | 3 0.1 0.2 0.3 NaN NaN NaN
levels
This is used in conjunction with the keys
argument.When levels
is left as its default value of None
, Pandas will take the unique values of each level of the resulting MultiIndex
and use that as the object used in the resulting index.levels
attribute.
levels: list of sequences, default None
Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
If Pandas already infers what these levels should be, what advantage is there to specify it ourselves? I'll show one example and leave it up to you to think up other reasons why this might be useful.
Example
Per the documentation, the levels
argument is a list of sequences. This means that we can use another pandas.Index
as one of those sequences.
Consider the data frame df
that is the concatenation of d1
, d2
and d3
:
df = pd.concat(
[d1, d2, d3], axis=1,
keys=['First', 'Second', 'Fourth'])
df
First Second Fourth
A B C B C D A B D
1 NaN NaN NaN 0.4 0.5 0.6 0.7 0.8 0.9
2 0.1 0.2 0.3 0.4 0.5 0.6 NaN NaN NaN
3 0.1 0.2 0.3 NaN NaN NaN 0.7 0.8 0.9
The levels of the columns object are:
print(df, *df.columns.levels, sep='\n')
Index(['First', 'Second', 'Fourth'], dtype='object')
Index(['A', 'B', 'C', 'D'], dtype='object')
If we use sum
within a groupby
we get:
df.groupby(axis=1, level=0).sum()
First Fourth Second
1 0.0 2.4 1.5
2 0.6 0.0 1.5
3 0.6 2.4 0.0
But what if instead of ['First', 'Second', 'Fourth']
there were another missing categories named Third
and Fifth
? And I wanted them included in the results of a groupby
aggregation? We can do this if we had a pandas.CategoricalIndex
. And we can specify that ahead of time with the levels
argument.
So instead, let's define df
as:
cats = ['First', 'Second', 'Third', 'Fourth', 'Fifth']
lvl = pd.CategoricalIndex(cats, categories=cats, ordered=True)
df = pd.concat(
[d1, d2, d3], axis=1,
keys=['First', 'Second', 'Fourth'],
levels=[lvl]
)
df
First Fourth Second
1 0.0 2.4 1.5
2 0.6 0.0 1.5
3 0.6 2.4 0.0
But the first level of the columns object is:
df.columns.levels[0]
CategoricalIndex(
['First', 'Second', 'Third', 'Fourth', 'Fifth'],
categories=['First', 'Second', 'Third', 'Fourth', 'Fifth'],
ordered=True, dtype='category')
And our groupby
summation looks like:
df.groupby(axis=1, level=0).sum()
First Second Third Fourth Fifth
1 0.0 1.5 0.0 2.4 0.0
2 0.6 1.5 0.0 0.0 0.0
3 0.6 0.0 0.0 2.4 0.0
names
This is used to name the levels of a resulting MultiIndex
. The length of the names
list should match the number of levels in the resulting MultiIndex
.
names: list, default None
Names for the levels in the resulting hierarchical index
# axis=0 | # axis=1
pd.concat( | pd.concat(
[d1, d2], | [d1, d2],
keys=[0, 1], | axis=1, keys=[0, 1],
names=['lvl0', 'lvl1']) | names=['lvl0', 'lvl1'])
----------------------------- | ----------------------------------
A B C D | lvl0 0 1
lvl0 lvl1 | lvl1 A B C B C D
0 2 0.1 0.2 0.3 NaN | 1 NaN NaN NaN 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN | 2 0.1 0.2 0.3 0.4 0.5 0.6
1 1 NaN 0.4 0.5 0.6 | 3 0.1 0.2 0.3 NaN NaN NaN
2 NaN 0.4 0.5 0.6 |
verify_integrity
Self explanatory documentation
verify_integrity: boolean, default False
Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
Because the resulting index from concatenating d1
and d2
is not unique, it would fail the integrity check.
pd.concat([d1, d2])
A B C D
2 0.1 0.2 0.3 NaN
3 0.1 0.2 0.3 NaN
1 NaN 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6
And
pd.concat([d1, d2], verify_integrity=True)
> ValueError: Indexes have overlapping values: [2]