How do you update the levels of a pandas MultiIndex after slicing its DataFrame?
I have a Dataframe with a pandas MultiIndex:
In [1]: import pandas as pd
In [2]: multi_index = pd.MultiIndex.from_product([['CAN','USA'],['total']],names=['country','sex'])
In [3]: df = pd.DataFrame({'pop':[35,318]},index=multi_index)
In [4]: df
Out[4]:
pop
country sex
CAN total 35
USA total 318
Then I remove some rows from that DataFrame:
In [5]: df = df.query('pop > 100')
In [6]: df
Out[6]:
pop
country sex
USA total 318
But when I consult the MutliIndex, it still has both countries in its levels.
In [7]: df.index.levels[0]
Out[7]: Index([u'CAN', u'USA'], dtype='object')
I can fix this myself in a rather strange way:
In [8]: idx_names = df.index.names
In [9]: df = df.reset_index(drop=False)
In [10]: df = df.set_index(idx_names)
In [11]: df
Out[11]:
pop
country sex
USA total 318
In [12]: df.index.levels[0]
Out[12]: Index([u'USA'], dtype='object')
But this seems rather messy. Is there a better way I'm missing?
From version pandas 0.20.0+
use MultiIndex.remove_unused_levels
:
print (df.index)
MultiIndex(levels=[['CAN', 'USA'], ['total']],
labels=[[1], [0]],
names=['country', 'sex'])
df.index = df.index.remove_unused_levels()
print (df.index)
MultiIndex(levels=[['USA'], ['total']],
labels=[[0], [0]],
names=['country', 'sex'])
This is something that has bitten me before. Dropping columns or rows does NOT change the underlying MultiIndex, for performance and philosophical reasons, and this is officially not considered a bug (read more here). The short answer is that the developers say "that's not what the MultiIndex is for". If you need a list of the contents of a MultiIndex level after modification, for example for iteration or to check to see if something is included, you can use:
df.index.get_level_values(<levelname>)
This returns the current active values within that index level.
So I guess the "trick" here is that the API native way to do it is to use get_level_values instead of just .index or .columns