groupby function returns undesired result for pandas dataframe
Use GroupBy.agg
with remove missing values by Series.dropna
:
df2 = (df.groupby(['uniprot_id'])[['protein_group','protein_family','protein_subfamily']]
.agg(lambda x: '; '.join(set(x.dropna())))
.reset_index())
print (df2)
uniprot_id protein_group protein_family protein_subfamily
0 O00141 AGC SGK
1 P35916 TK VEGFR
2 P45985 STE STE7
3 Q13163 STE STE7
4 Q5VT25 AGC DMPK GEK
5 Q6P3W7 Other SCY1
6 Q8TAS1 Other KIS
7 Q96S53 TKL LISK TESK
8 Q96SB4 CMGC SRPK
9 Q9UKI8 Other TLK
If order is important dont use set
s, because there is order not defined, use dict.fromkeys
trick:
df2 = (df.groupby(['uniprot_id'])[['protein_group','protein_family','protein_subfamily']]
.agg(lambda x: '; '.join(dict.fromkeys(x.dropna()).keys()))
.reset_index())