Combine Pandas data frame column values into new column
I'm working with Pandas and I have a data frame where we can have one of three values populated:
ID_1 ID_2 ID_3
abc NaN NaN
NaN def NaN
NaN NaN ghi
NaN NaN jkl
NaN mno NaN
pqr NaN NaN
And my goal is to combine these three columns into a new columns in my data frame:
ID_1 ID_2 ID_3 Combined_ID
abc NaN NaN abc
NaN def NaN def
NaN NaN ghi ghi
NaN NaN jkl jkl
NaN mno NaN mno
pqr NaN NaN pqr
Ideally it would just find whatever not null value exists in columns 1 through 3, but I could also concatenate since we should only have one of the three populated for each row. Thanks.
df_note = pd.read_csv("NoteIds.csv")
df_note['Combined_ID'] = # ID_1 + ID_2 + ID_3
You can use the property that summing will concatenate the string values, so you could call fillna
and pass an empty str and the call sum
and pass param axis=1
to sum row-wise:
In [26]:
df['Combined_ID'] = df.fillna('').sum(axis=1)
df
Out[26]:
ID_1 ID_2 ID_3 Combined_ID
0 abc NaN NaN abc
1 NaN def NaN def
2 NaN NaN ghi ghi
3 NaN NaN jkl jkl
4 NaN mno NaN mno
5 pqr NaN NaN pqr
If you're only interested in those 3 columns you can just select them:
In [39]:
df['Combined_ID'] = df[['ID_1','ID_2','ID_3']].fillna('').sum(axis=1)
df
Out[39]:
ID_1 ID_2 ID_3 Combined_ID
0 abc NaN NaN abc
1 NaN def NaN def
2 NaN NaN ghi ghi
3 NaN NaN jkl jkl
4 NaN mno NaN mno
5 pqr NaN NaN pqr
Let's assume that there can be more than one non-NaN value per row. Still this should work.
In [43]: df['Combined_ID'] = df.apply(
lambda x : ''.join([e for e in x if isinstance(e, basestring)]),
axis=1)
For each row, extract string items and join them.
In [44]: df
Out[44]:
ID_1 ID_2 ID_3 Combined_ID
0 abc NaN NaN abc
1 NaN def NaN def
2 NaN NaN ghi ghi
3 NaN NaN jkl jkl
4 NaN mno NaN mno
5 pqr NaN NaN pqr
I liked @EdChum's answer and looks more readable.
Interestingly, fillna('').sum(axis=1)
method is expensive for this smaller data.
In [45]: %timeit df.fillna('').sum(axis=1)
1000 loops, best of 3: 808 µs per loop
In [46]: %timeit df.apply(lambda x : ''.join([e for e in x if isinstance(e, basestring)]), axis=1)
1000 loops, best of 3: 285 µs per loop
For, ['ID_1','ID_2','ID_3']
columns only
df[['ID_1','ID_2','ID_3']].apply(lambda_function)