Selecting values from dataframe based on multiple column values

I have a dataframe in this format:

ageClass sex nationality treatment unique_id netTime clockTime
0 20 M KEN Treatment 354658649da56c20c72b6689d2b7e1b8cc334ac9 7661 7661
1 20 M KEN Treatment 1da607e762ac07eba6f9b5a717e9ff196d987242 7737 7737
2 20 M KEN Control 1de4a95cef28c290ba5790217288f510afc3b26b 7747 7747
3 30 M KEN Control 12215d93d2cb5b0234991a64d097955338a73dd3 7750 7750
4 30 M KEN Treatment 5375986567be20b49067956e989884908fb807f6 8163 8163
5 20 M ETH Treatment 613be609b3f4a38834c2bc35bffbdb6c47418666 7811 7811
6 20 M KEN Control 70fb3284d112dc27a5cad7f705b38bc91f56ecad 7853 7853
7 30 M AUS Control 0ea5d606a83eb68c89da0a98543b815e383835e3 7902 7902
8 20 M BRA Control ecdd57df778ad901b41e79dd2713a23cb8f13860 7923 7923
9 20 M ETH Control ae7fe893268d86b7a1bdb4489f9a0798797c718c 7927 7927

The objective is to determine which age class benefitted most from being in the treatment group as measured by clocktime. That means i need to somehow group all values for members in each agegroup for both treatment and control conditions and take an average of their clocktimes. Then following that i need to take the difference of the average clocktimes for the subgroups and compare all of these against one another.

Where i am stuck is with filtering the dataframe based on multiple columns simulatneously. I tried using groupby() as follows:

df.groupby(['ageClass','treatment'])['clockTime'].mean()

However I was not able to then calculate the difference in the mean times from the resulting series.

How should I move forward?


You can pivot the table with means you produced

df2 = df.groupby(['ageClass','treatment'])[['clockTime']].mean().reset_index().pivot(columns=['ageClass'], values='clockTime', index='treatment')

ageClass    20  30
treatment       
Control     7862.500000     7826.0
Treatment   7736.333333     8163.0

Then it's easy to find a difference

df2['diff'] = df2[20] - df2[30]

treatment
Control       36.500000
Treatment   -426.666667
Name: diff, dtype: float64

From the groupby you've already done, you can groupby index level 0, i.e. 'ageClass' and then use diff to find the difference between the averages of treatment and control groups for each 'ageClass'. Since diff subtracts the second from the first (and "Control" and "Treatment" are sorted alphabetically), add "-Control" to make it a bit clearer.

s = df.groupby(['ageClass','treatment'])['clockTime'].mean()
out = s.groupby(level=0).diff().dropna().reset_index()
out = out.assign(treatment=out['treatment']+'-Control')

Output:

   ageClass          treatment   clockTime
0        20  Treatment-Control -126.166667
1        30  Treatment-Control  337.000000

From your problem description, I would prescribe ranking. Differences between groups wont tell who benefited the most

s=df.groupby(['ageClass','treatment'])['clockTime'].agg('mean').reset_index()

s['rank']=s.groupby('ageClass')['clockTime'].rank()



    ageClass  treatment    clockTime  rank
0        20    Control  7862.500000   2.0
1        20  Treatment  7736.333333   1.0
2        30    Control  7826.000000   1.0
3        30  Treatment  8163.000000   2.0