Filling NA based on match from other column [duplicate]
Let's suppose that I have the following dataset:
Time Geography Sex Population
1990 Northern Ireland Male NA
1990 Northern Ireland Female NA
1990 Northern Ireland Total NA
1991 Northern Ireland Male NA
1991 Northern Ireland Female NA
1991 Northern Ireland Total NA
1992 Northern Ireland Male 792100
1992 Northern Ireland Female 831100
1992 Northern Ireland Total 1623300
1993 Northern Ireland Male 812100
1993 Northern Ireland Female 851100
1993 Northern Ireland Total 1663200
and finally I want to have the following:
Time Geography Sex Population
1990 Northern Ireland Male 792100
1990 Northern Ireland Female 831100
1990 Northern Ireland Total 1623300
1991 Northern Ireland Male 792100
1991 Northern Ireland Female 831100
1991 Northern Ireland Total 1623300
1992 Northern Ireland Male 792100
1992 Northern Ireland Female 831100
1992 Northern Ireland Total 1623300
1993 Northern Ireland Male 812100
1993 Northern Ireland Female 851100
1993 Northern Ireland Total 1663200
meaning that basically I want to fill in the values of the previous years with the values of the first year without NAs.
How do I do this?
You can chain pandas.DataFrame.sort_values
, pandas.DataFrame.fillna
with method bfill
and after that pandas.DataFrame.sort_index
to get your original index back in order:
df = df.sort_values(['Sex']).fillna(method='bfill').sort_index()
print(df)
Time Geography Sex Population
0 1990 Northern Ireland Male 792100.0
1 1990 Northern Ireland Female 831100.0
2 1990 Northern Ireland Total 1623300.0
3 1991 Northern Ireland Male 792100.0
4 1991 Northern Ireland Female 831100.0
5 1991 Northern Ireland Total 1623300.0
6 1992 Northern Ireland Male 792100.0
7 1992 Northern Ireland Female 831100.0
8 1992 Northern Ireland Total 1623300.0
You can try this:
df.set_index(['Time','Geography','Sex']).unstack().bfill().stack().reset_index()
Output:
Time Geography Sex Population
0 1990 Northern Ireland Female 831100.0
1 1990 Northern Ireland Male 792100.0
2 1990 Northern Ireland Total 1623300.0
3 1991 Northern Ireland Female 831100.0
4 1991 Northern Ireland Male 792100.0
5 1991 Northern Ireland Total 1623300.0
6 1992 Northern Ireland Female 831100.0
7 1992 Northern Ireland Male 792100.0
8 1992 Northern Ireland Total 1623300.0
I will using groupby
and bfill
and ffill
(I am adding ffill
and bfill
just for protection)
df['Population']=df.groupby(['Geography','Sex']).Population.apply(lambda x : x.ffill().bfill())
df
Time Geography Sex Population
0 1990 NorthernIreland Male 792100.0
1 1990 NorthernIreland Female 831100.0
2 1990 NorthernIreland Total 1623300.0
3 1991 NorthernIreland Male 792100.0
4 1991 NorthernIreland Female 831100.0
5 1991 NorthernIreland Total 1623300.0
6 1992 NorthernIreland Male 792100.0
7 1992 NorthernIreland Female 831100.0
8 1992 NorthernIreland Total 1623300.0