Fill in series to mirror a column if another column contains missing values in Python
I have a dataset where I would like to fill in a series to mirror a column if another column contains missing values in Python
Data
index1 id stat id_A id_B
1 aa y aa_Q1.22_1 aa_Q1.22_1
1 aa y aa_Q2.22_2 aa_Q2.22_2
1 bb n bb_Q3.22_3 bb_Q3.22_3
12 bb n bb_Q4.22_4 bb_Q4.22_4
2 cc_Q3.22_4
2 cc_Q4.22_5
2 dd_Q3.22_5
2 ee_Q4.22_6
2 ee_Q3.22_6
2 ee_Q4.22_7
Desired
index1 id stat id_A id_B
1 aa y aa_Q1.22_1 aa_Q1.22_1
1 aa y aa_Q2.22_2 aa_Q2.22_2
1 bb n bb_Q3.22_3 bb_Q3.22_3
12 bb n bb_Q4.22_4 bb_Q4.22_4
2 cc_Q3.22_4 cc_Q3.22_4
2 cc_Q4.22_5 cc_Q4.22_5
2 dd_Q3.22_5 dd_Q3.22_5
2 ee_Q4.22_6 ee_Q4.22_6
2 ee_Q3.22_6 ee_Q3.22_6
2 ee_Q4.22_7 ee_Q4.22_7
I would like to fill in column 'id_B' with the exact values from 'id_A' if 'id_B' is blank.
Doing
I am thinking I should you the .apply method-
df['id_A'] = df.apply(
lambda row: row[''].row[''] if np.isnan,
axis=1
)
Any suggestion is appreciated. Still researching.
Another option is to use np.where
. Simply choose from "id_A" or "id_B" depending on if df['id_B']== ''
is True or False:
df['id_B'] = np.where(df['id_B']== '', df['id_A'], df['id_B'])
Output:
index1 id stat id_A id_B
0 1 aa y aa_Q1.22_1 aa_Q1.22_1
1 1 aa y aa_Q2.22_2 aa_Q2.22_2
2 1 bb n bb_Q3.22_3 bb_Q3.22_3
3 12 bb n bb_Q4.22_4 bb_Q4.22_4
4 2 cc_Q3.22_4 cc_Q3.22_4
5 2 cc_Q4.22_5 cc_Q4.22_5
6 2 dd_Q3.22_5 dd_Q3.22_5
7 2 ee_Q4.22_6 ee_Q4.22_6
8 2 ee_Q3.22_6 ee_Q3.22_6
9 2 ee_Q4.22_7 ee_Q4.22_7
An alternative is to use mask
and combine_first
:
df['id_B'] = df['id_B'].mask(df['id_B'] == '').combine_first(df['id_A'])
print(df)
# Output
index1 id stat id_A id_B
0 1 aa y aa_Q1.22_1 aa_Q1.22_1
1 1 aa y aa_Q2.22_2 aa_Q2.22_2
2 1 bb n bb_Q3.22_3 bb_Q3.22_3
3 12 bb n bb_Q4.22_4 bb_Q4.22_4
4 2 cc_Q3.22_4 cc_Q3.22_4
5 2 cc_Q4.22_5 cc_Q4.22_5
6 2 dd_Q3.22_5 dd_Q3.22_5
7 2 ee_Q4.22_6 ee_Q4.22_6
8 2 ee_Q3.22_6 ee_Q3.22_6
9 2 ee_Q4.22_7 ee_Q4.22_7
Setup:
data = {'index1': [1, 1, 1, 12, 2, 2, 2, 2, 2, 2],
'id': ['aa', 'aa', 'bb', 'bb', '', '', '', '', '', ''],
'stat': ['y', 'y', 'n', 'n', '', '', '', '', '', ''],
'id_A': ['aa_Q1.22_1', 'aa_Q2.22_2', 'bb_Q3.22_3', 'bb_Q4.22_4',
'cc_Q3.22_4', 'cc_Q4.22_5', 'dd_Q3.22_5', 'ee_Q4.22_6',
'ee_Q3.22_6', 'ee_Q4.22_7'],
'id_B': ['aa_Q1.22_1', 'aa_Q2.22_2', 'bb_Q3.22_3', 'bb_Q4.22_4',
'', '', '', '', '', '']}
df = pd.DataFrame(df)
print(df)
# Output
index1 id stat id_A id_B
0 1 aa y aa_Q1.22_1 aa_Q1.22_1
1 1 aa y aa_Q2.22_2 aa_Q2.22_2
2 1 bb n bb_Q3.22_3 bb_Q3.22_3
3 12 bb n bb_Q4.22_4 bb_Q4.22_4
4 2 cc_Q3.22_4
5 2 cc_Q4.22_5
6 2 dd_Q3.22_5
7 2 ee_Q4.22_6
8 2 ee_Q3.22_6
9 2 ee_Q4.22_7
Replace empty strings to missing values and then use Series.fillna
:
df['id_B'] = df['id_B'].replace('',np.nan).fillna(df['id_A'])
Or use DataFrame.loc
:
df.loc[df['id_B'].eq(''), 'id_B'] = df['id_A']
If empty strings are in real data missing values or None
like Nonetype
:
df['id_B'] = df['id_B'].fillna(df['id_A'])
Or:
df.loc[df['id_B'].isna(), 'id_B'] = df['id_A']