Resampling Within a Pandas MultiIndex
Solution 1:
pd.Grouper
allows you to specify a "groupby instruction for a target object". In
particular, you can use it to group by dates even if df.index
is not a DatetimeIndex
:
df.groupby(pd.Grouper(freq='2D', level=-1))
The level=-1
tells pd.Grouper
to look for the dates in the last level of the MultiIndex.
Moreover, you can use this in conjunction with other level values from the index:
level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
It looks a bit awkward, but using_Grouper
turns out to be much faster than my original
suggestion, using_reset_index
:
import numpy as np
import pandas as pd
import datetime as DT
def using_Grouper(df):
level_values = df.index.get_level_values
return (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
def using_reset_index(df):
df = df.reset_index(level=[0, 1])
return df.groupby(['State','City']).resample('2D').sum()
def using_stack(df):
# http://stackoverflow.com/a/15813787/190597
return (df.unstack(level=[0,1])
.resample('2D').sum()
.stack(level=[2,1])
.swaplevel(2,0))
def make_orig():
values_a = range(16)
values_b = range(10, 26)
states = ['Georgia']*8 + ['Alabama']*8
cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
df = pd.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index = [states, cities, dates])
df.index.names = ['State', 'City', 'Date']
return df
def make_df(N):
dates = pd.date_range('2000-1-1', periods=N)
states = np.arange(50)
cities = np.arange(10)
index = pd.MultiIndex.from_product([states, cities, dates],
names=['State', 'City', 'Date'])
df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
columns=['value_a', 'value_b'])
return df
df = make_orig()
print(using_Grouper(df))
yields
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Here is a benchmark comparing using_Grouper
, using_reset_index
, using_stack
on a 5000-row DataFrame:
In [30]: df = make_df(10)
In [34]: len(df)
Out[34]: 5000
In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop
In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop
In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop
Solution 2:
You need the groupby()
method and provide it with a pd.Grouper
for each level of your MultiIndex you wish to maintain in the resulting DataFrame. You can then apply an operation of choice.
To resample date or timestamp levels, you need to set the freq
argument with the frequency of choice — a similar approach using pd.TimeGrouper()
is deprecated in favour of pd.Grouper()
with the freq
argument set.
This should give you the DataFrame you need:
df.groupby([pd.Grouper(level='State'),
pd.Grouper(level='City'),
pd.Grouper(level='Date', freq='2D')]
).sum()
The Time Series Guide in the pandas documentation describes resample()
as:
... a time-based groupby, followed by a reduction method on each of its groups.
Hence, using groupby()
should technically be the same operation as using .resample()
on a DataFrame with a single index.
The same paragraph points to the cookbook section on resampling for more advanced examples, where the 'Grouping using a MultiIndex' entry is highly relevant for this question. Hope that helps.
Solution 3:
An alternative using stack/unstack
df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
Alabama Mobile 2012-01-01 17 37
Montgomery 2012-01-01 25 45
Georgia Savanna 2012-01-01 9 29
Atlanta 2012-01-03 5 25
Alabama Mobile 2012-01-03 21 41
Montgomery 2012-01-03 29 49
Georgia Savanna 2012-01-03 13 33
Notes:
- No idea about performance comparison
- Possible pandas bug - stack(level=[2,1]) worked, but stack(level=[1,2]) failed
Solution 4:
This works:
df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
If the Date column is strings, then convert to datetime beforehand:
df['Date'] = pd.to_datetime(df['Date'])