How to impute nan values in a Pandas dataframe from a multi-index dataframe?
I have the following dataframe:
df = pd.DataFrame([[np.nan, 2, 20, 4],
[3, 1, np.nan, 1],
[3, 1, 15, 1],
[np.nan, 1, np.nan, 1],
[10, 1, 30, 4],
[50, 2, 35, 4],
[10, 1, 37, 4],
[40, 2, 30, 1]],
columns=list("ABCD"))
I want to fill the Nan values with their group means. Towards that purpose, I run the following:
df_mean=df.groupby(["B","D"]).mean()
df_mean
A C
B D
1 1 3.0 15.0
4 10.0 33.5
2 1 40.0 30.0
4 50.0 27.5
Is there a way to fill the dataframe df with the values computed in df_mean?
One way to do this would be as in this answer
df[["A", "C"]] = (
df
# create groups
.groupby(["B", "D"])
# transform the groups by filling na values with the group mean
.transform(lambda x: x.fillna(x.mean()))
)
However, for a few millions of rows, where the simple groupby([...]).mean() would take a few seconds, take too long...
It there a quicker way to solve this?
Use GroupBy.transform
by mean
and pass to DataFrame.fillna
:
df = df.fillna(df.groupby(["B", "D"]).transform('mean'))
print (df)
A B C D
0 50.0 2 20.0 4
1 3.0 1 15.0 1
2 3.0 1 15.0 1
3 3.0 1 15.0 1
4 10.0 1 30.0 4
5 50.0 2 35.0 4
6 10.0 1 37.0 4
7 40.0 2 30.0 1
Your solution with aggregation is possible also use this way:
df = df.fillna(df[['B','D']].join(df.groupby(["B","D"]).mean(), on=['B','D']))
print (df)
A B C D
0 50.0 2 20.0 4
1 3.0 1 15.0 1
2 3.0 1 15.0 1
3 3.0 1 15.0 1
4 10.0 1 30.0 4
5 50.0 2 35.0 4
6 10.0 1 37.0 4
7 40.0 2 30.0 1
You can use combine_first
:
out = df.combine_first(df.groupby(['B', 'D']).transform('mean'))
print(out)
# Output
A B C D
0 50.0 2 20.0 4
1 3.0 1 15.0 1
2 3.0 1 15.0 1
3 3.0 1 15.0 1
4 10.0 1 30.0 4
5 50.0 2 35.0 4
6 10.0 1 37.0 4
7 40.0 2 30.0 1