Way to use groupby over indices in pandas
[Edited to provide better working example data]
Imagine I have a panel in pandas with multiple time_vars
for each id_var
. For example, imagine the data contain a state variable and a year variable:
var1 var2
stcode year
WY 1996-01-01 14 0.000059
1995-01-01 8 0.000059
1994-01-01 21 0.000182
1993-01-01 17 0.000063
1992-01-01 9 0.000000
AK 1964-01-01 11 0.000213
1965-01-01 6 0.000100
1966-01-01 10 0.000189
1967-01-01 9 0.000267
1968-01-01 9 0.000084
Is there a way to use pd.groupby
and do operations on the year
level of the multiindex as the object or am I constrained to the columns defined within the groupby variable?
As an example, suppose I wanted to check if the dates were sorted, but I only cared about whether the year
level was sorted, and not the stcode
level. I currently would implement this like so:
# unsetting year from the index, so that I can groupby the remaining level and
# check if year is monotonically increasing
df.reset_index('year').groupby('stcode').year.is_monotonic_increasing
This returns (as you can see):
stcode
AK True
WY False
Name: year, dtype: bool
However, unsetting the index in this way seems awfully inefficient. Is there a general way to do somthing like:
df.groupby(level=0)[ -- the other level -- ].apply( -- operation -- )
Similarly, and relatedly, is there a way to sort by the second level and ignore the first? In this example, I may need the time variable sorted within each stcode
block to do time dependent operations like shifting, but I do not care if the outer level is sorted alphabetically or numerically. In other words, this would end up like:
var1 var2
stcode year
WY 1992-01-01 9 0.000000
1993-01-01 17 0.000063
1994-01-01 21 0.000182
1995-01-01 8 0.000059
1996-01-01 14 0.000059
AK 1964-01-01 11 0.000213
1965-01-01 6 0.000100
1966-01-01 10 0.000189
1967-01-01 9 0.000267
1968-01-01 9 0.000084
Here is a dict
of the original DataFrame:
{'var1': {('WY', Timestamp('1996-01-01 00:00:00')): 14,
('WY', Timestamp('1995-01-01 00:00:00')): 8,
('WY', Timestamp('1994-01-01 00:00:00')): 21,
('WY', Timestamp('1993-01-01 00:00:00')): 17,
('WY', Timestamp('1992-01-01 00:00:00')): 9,
('AK', Timestamp('1964-01-01 00:00:00')): 11,
('AK', Timestamp('1965-01-01 00:00:00')): 6,
('AK', Timestamp('1966-01-01 00:00:00')): 10,
('AK', Timestamp('1967-01-01 00:00:00')): 9,
('AK', Timestamp('1968-01-01 00:00:00')): 9},
'var2': {('WY', Timestamp('1996-01-01 00:00:00')): 5.855486597283743e-05,
('WY', Timestamp('1995-01-01 00:00:00')): 5.91261159570422e-05,
('WY', Timestamp('1994-01-01 00:00:00')): 0.00018243736121803522,
('WY', Timestamp('1993-01-01 00:00:00')): 6.34477473795414e-05,
('WY', Timestamp('1992-01-01 00:00:00')): 0.0,
('AK', Timestamp('1964-01-01 00:00:00')): 0.0002131750516127795,
('AK', Timestamp('1965-01-01 00:00:00')): 0.00010040206689154729,
('AK', Timestamp('1966-01-01 00:00:00')): 0.0001891511055873707,
('AK', Timestamp('1967-01-01 00:00:00')): 0.00026726152282208204,
('AK', Timestamp('1968-01-01 00:00:00')): 8.391729352297261e-05}}
Solution 1:
I personally find the approach you used clean and reasonable.
However, you could use:
df.groupby(level=0).apply(lambda g: g.index.is_monotonic_increasing)
as the first level will necessarily be monotonic increasing per group.
Or, to really drop the first level, if needed (not required here):
df.groupby(level=0).apply(lambda g: g.index.droplevel(0).is_monotonic_increasing)
Output:
AK True
WY False
dtype: bool
sorting only the second level
Use groupby
with group_keys=False, sort=False
:
df.groupby(level=0, group_keys=False, sort=False).apply(lambda g: g.sort_index())
Output:
var1 var2
WY 1992-01-01 9 0.000000
1993-01-01 17 0.000063
1994-01-01 21 0.000182
1995-01-01 8 0.000059
1996-01-01 14 0.000059
AK 1964-01-01 11 0.000213
1965-01-01 6 0.000100
1966-01-01 10 0.000189
1967-01-01 9 0.000267
1968-01-01 9 0.000084