How to group pandas DataFrame entries by date in a non-unique column
I'm using pandas 0.16.2. This has better performance on my large dataset:
data.groupby(data.date.dt.year)
Using the dt
option and playing around with weekofyear
, dayofweek
etc. becomes far easier.
ecatmur's solution will work fine. This will be better performance on large datasets, though:
data.groupby(data['date'].map(lambda x: x.year))
This might be easier to explain with a sample dataset.
Create Sample Data
Let's assume we have a single column of Timestamps, date
and another column we would like to perform an aggregation on, a
.
df = pd.DataFrame({'date':pd.DatetimeIndex(['2012-1-1', '2012-6-1', '2015-1-1', '2015-2-1', '2015-3-1']),
'a':[9,5,1,2,3]}, columns=['date', 'a'])
df
date a
0 2012-01-01 9
1 2012-06-01 5
2 2015-01-01 1
3 2015-02-01 2
4 2015-03-01 3
There are several ways to group by year
- Use the dt accessor with
year
property - Put
date
in index and use anonymous function to access year - Use
resample
method - Convert to pandas Period
.dt
accessor with year
property
When you have a column (and not an index) of pandas Timestamps, you can access many more extra properties and methods with the dt
accessor. For instance:
df['date'].dt.year
0 2012
1 2012
2 2015
3 2015
4 2015
Name: date, dtype: int64
We can use this to form our groups and calculate some aggregations on a particular column:
df.groupby(df['date'].dt.year)['a'].agg(['sum', 'mean', 'max'])
sum mean max
date
2012 14 7 9
2015 6 2 3
put date in index and use anonymous function to access year
If you set the date column as the index, it becomes a DateTimeIndex with the same properties and methods as the dt
accessor gives normal columns
df1 = df.set_index('date')
df1.index.year
Int64Index([2012, 2012, 2015, 2015, 2015], dtype='int64', name='date')
Interestingly, when using the groupby method, you can pass it a function. This function will be implicitly passed the DataFrame's index. So, we can get the same result from above with the following:
df1.groupby(lambda x: x.year)['a'].agg(['sum', 'mean', 'max'])
sum mean max
2012 14 7 9
2015 6 2 3
Use the resample
method
If your date column is not in the index, you must specify the column with the on
parameter. You also need to specify the offset alias as a string.
df.resample('AS', on='date')['a'].agg(['sum', 'mean', 'max'])
sum mean max
date
2012-01-01 14.0 7.0 9.0
2013-01-01 NaN NaN NaN
2014-01-01 NaN NaN NaN
2015-01-01 6.0 2.0 3.0
Convert to pandas Period
You can also convert the date column to a pandas Period object. We must pass in the offset alias as a string to determine the length of the Period.
df['date'].dt.to_period('A')
0 2012
1 2012
2 2015
3 2015
4 2015
Name: date, dtype: object
We can then use this as a group
df.groupby(df['date'].dt.to_period('Y'))['a'].agg(['sum', 'mean', 'max'])
sum mean max
2012 14 7 9
2015 6 2 3
This should work:
data.groupby(lambda x: data['date'][x].year)