Selecting values from dataframe based on multiple column values
I have a dataframe in this format:
ageClass | sex | nationality | treatment | unique_id | netTime | clockTime | |
---|---|---|---|---|---|---|---|
0 | 20 | M | KEN | Treatment | 354658649da56c20c72b6689d2b7e1b8cc334ac9 | 7661 | 7661 |
1 | 20 | M | KEN | Treatment | 1da607e762ac07eba6f9b5a717e9ff196d987242 | 7737 | 7737 |
2 | 20 | M | KEN | Control | 1de4a95cef28c290ba5790217288f510afc3b26b | 7747 | 7747 |
3 | 30 | M | KEN | Control | 12215d93d2cb5b0234991a64d097955338a73dd3 | 7750 | 7750 |
4 | 30 | M | KEN | Treatment | 5375986567be20b49067956e989884908fb807f6 | 8163 | 8163 |
5 | 20 | M | ETH | Treatment | 613be609b3f4a38834c2bc35bffbdb6c47418666 | 7811 | 7811 |
6 | 20 | M | KEN | Control | 70fb3284d112dc27a5cad7f705b38bc91f56ecad | 7853 | 7853 |
7 | 30 | M | AUS | Control | 0ea5d606a83eb68c89da0a98543b815e383835e3 | 7902 | 7902 |
8 | 20 | M | BRA | Control | ecdd57df778ad901b41e79dd2713a23cb8f13860 | 7923 | 7923 |
9 | 20 | M | ETH | Control | ae7fe893268d86b7a1bdb4489f9a0798797c718c | 7927 | 7927 |
The objective is to determine which age class benefitted most from being in the treatment group as measured by clocktime. That means i need to somehow group all values for members in each agegroup for both treatment and control conditions and take an average of their clocktimes. Then following that i need to take the difference of the average clocktimes for the subgroups and compare all of these against one another.
Where i am stuck is with filtering the dataframe based on multiple columns simulatneously. I tried using groupby() as follows:
df.groupby(['ageClass','treatment'])['clockTime'].mean()
However I was not able to then calculate the difference in the mean times from the resulting series.
How should I move forward?
You can pivot the table with means you produced
df2 = df.groupby(['ageClass','treatment'])[['clockTime']].mean().reset_index().pivot(columns=['ageClass'], values='clockTime', index='treatment')
ageClass 20 30
treatment
Control 7862.500000 7826.0
Treatment 7736.333333 8163.0
Then it's easy to find a difference
df2['diff'] = df2[20] - df2[30]
treatment
Control 36.500000
Treatment -426.666667
Name: diff, dtype: float64
From the groupby
you've already done, you can groupby
index level 0, i.e. 'ageClass'
and then use diff
to find the difference between the averages of treatment and control groups for each 'ageClass'
. Since diff
subtracts the second from the first (and "Control" and "Treatment" are sorted alphabetically), add "-Control"
to make it a bit clearer.
s = df.groupby(['ageClass','treatment'])['clockTime'].mean()
out = s.groupby(level=0).diff().dropna().reset_index()
out = out.assign(treatment=out['treatment']+'-Control')
Output:
ageClass treatment clockTime
0 20 Treatment-Control -126.166667
1 30 Treatment-Control 337.000000
From your problem description, I would prescribe ranking. Differences between groups wont tell who benefited the most
s=df.groupby(['ageClass','treatment'])['clockTime'].agg('mean').reset_index()
s['rank']=s.groupby('ageClass')['clockTime'].rank()
ageClass treatment clockTime rank
0 20 Control 7862.500000 2.0
1 20 Treatment 7736.333333 1.0
2 30 Control 7826.000000 1.0
3 30 Treatment 8163.000000 2.0