How to summarize on different groupby combinations?
Since your data seem to guarantee 3 unique crops per country ("I am compiling a table of top-3 crops by county."), it suffices to sort the values and assign back.
import numpy as np
cols = ['Crop1', 'Crop2', 'Crop3']
df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)
County Crop1 Crop2 Crop3 Total_pop
0 Harney apples grain melons 2000
1 Baker apples grain melons 1500
2 Wheeler apples grain melons 3000
3 Hood River apples grain melons 1500
4 Wasco carrots pears raddish 2000
5 Morrow carrots pears raddish 2500
6 Union carrots pears raddish 2700
7 Lake carrots pears raddish 2000
Then to summarize:
df1.groupby(cols).sum()
# Total_pop
#Crop1 Crop2 Crop3
#apples grain melons 8000
#carrots pears raddish 9200
The benefit is that you avoid Series.apply
or .apply(axis=1)
. For larger DataFrames
, the performance difference is noticeable:
df1 = pd.concat([df1]*10000, ignore_index=True)
cols = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)
#36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
to_sum = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[to_sum] = pd.DataFrame(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum)
#1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is one way to do it.
First let's get the unique values across the columns and then reassign these values back to the DataFrame. We will perform this on a copy of the original data since you might need to preserve the original data.
df = df1.copy()
to_sum = ['Crop1', 'Crop2', 'Crop3']
df[to_sum] = pd.DataFrame(df.loc[:, to_sum] \
.apply(set, axis=1) \
.apply(sorted) \
.values \
.tolist(), columns=to_sum)
print(df)
County Crop1 Crop2 Crop3 Total_pop
0 Harney grain apples melons 2000
1 Baker grain apples melons 1500
2 Wheeler grain apples melons 3000
3 Hood River grain apples melons 1500
4 Wasco pears carrots raddish 2000
5 Morrow pears carrots raddish 2500
6 Union pears carrots raddish 2700
7 Lake pears carrots raddish 2000
Now we can perform our groupby
to get the desired results.
df.groupby(to_sum).Total_pop.sum()
Crop1 Crop2 Crop3
apples grain melons 8000
carrots pears raddish 9200
Name: Total_pop, dtype: int64
np.bincount
i, u = pd.factorize([*map(frozenset, zip(df1.Crop1, df1.Crop2, df1.Crop3))])
s = np.bincount(i, df1.Total_pop)
pd.Series(s, u)
(melons, grain, apples) 8000.0
(carrots, raddish, pears) 9200.0
dtype: float64
Or, if you want separate columns
pd.Series(dict(zip(map(tuple, u), s)))
melons grain apples 8000.0
carrots raddish pears 9200.0
dtype: float64
And fully pretty
pd.Series(dict(zip(map(tuple, u), s))) \
.rename_axis(['Crop1', 'Crop2', 'Crop3']).reset_index(name='Total_pop')
Crop1 Crop2 Crop3 Total_pop
0 melons grain apples 8000.0
1 carrots raddish pears 9200.0
Method 1:
Combine the crop
columns
>>> df1['combined_temp'] = df1.apply(lambda x : list([x['Crop1'],
... x['Crop2'],
... x['Crop3']]),axis=1)
>>> df1.head()
County Crop1 Crop2 Crop3 Total_pop combined_temp
0 Harney grain melons apples 2000 [grain, melons, apples]
1 Baker melons grain apples 1500 [melons, grain, apples]
2 Wheeler melons grain apples 3000 [melons, grain, apples]
3 Hood River apples melons grain 1500 [apples, melons, grain]
4 Wasco pears carrots raddish 2000 [pears, carrots, raddish]
make it a sorted tuple
>>> df1['sorted'] = df1.apply(lambda x : tuple(sorted(x['combined_temp'])),axis=1)
>>> df1.head()
County Crop1 Crop2 ... Total_pop combined_temp sorted
0 Harney grain melons ... 2000 [grain, melons, apples] (apples, grain, melons)
1 Baker melons grain ... 1500 [melons, grain, apples] (apples, grain, melons)
2 Wheeler melons grain ... 3000 [melons, grain, apples] (apples, grain, melons)
3 Hood River apples melons ... 1500 [apples, melons, grain] (apples, grain, melons)
4 Wasco pears carrots ... 2000 [pears, carrots, raddish] (carrots, pears, raddish)
then proceed to your normal group by operation
>>> df1_grouped = df1.groupby(['sorted'])['Total_pop'].sum().reset_index()
>>> df1_grouped
sorted Total_pop
0 (apples, grain, melons) 8000
1 (carrots, pears, raddish) 9200
Method 2: A shorted version based on the answer by aws-apprentice
df = df1.copy()
grouping_cols = ['Crop1', 'Crop2', 'Crop3']
df[grouping_cols] = pd.DataFrame(df.loc[:, grouping_cols] \
.apply(set, axis=1) \
.apply(sorted)
.values \
.tolist(), columns=grouping_cols)
>>> df.head()
County Crop1 Crop2 Crop3 Total_pop
0 Harney apples grain melons 2000
1 Baker apples grain melons 1500
2 Wheeler apples grain melons 3000
3 Hood River apples grain melons 1500
4 Wasco carrots pears raddish 2000
now take group by group by
>>> df.groupby(grouping_cols).Total_pop.sum()
Crop1 Crop2 Crop3
apples grain melons 8000
carrots pears raddish 9200
Name: Total_pop, dtype: int64
but i personally prefer this answer using numpy