How to merge pandas on string contains?
Solution 1:
New Answer
Here is one approach based on pandas/numpy.
rhs = (df1.column_common
.apply(lambda x: df2[df2.column_common.str.find(x).ge(0)]['column_b'])
.bfill(axis=1)
.iloc[:, 0])
(pd.concat([df1.column_a, rhs], axis=1, ignore_index=True)
.rename(columns={0: 'column_a', 1: 'column_b'}))
column_a column_b
0 John Moore
1 Michael Cohen
2 Dan Smith
3 George NaN
4 Adam Faber
Old Answer
Here's a solution for left-join behaviour, as in it doesn't keep column_a
values that do not match any column_b
values. This is slower than the above numpy/pandas solution because it uses two nested iterrows
loops to build a python list.
tups = [(a1, a2) for i, (a1, b1) in df1.iterrows()
for j, (a2, b2) in df2.iterrows()
if b1 in b2]
(pd.DataFrame(tups, columns=['column_a', 'column_b'])
.drop_duplicates('column_a')
.reset_index(drop=True))
column_a column_b
0 John Moore
1 Michael Cohen
2 Dan Smith
3 Adam Faber
Solution 2:
My solution involves applying a function to the common column. I can't imagine it holds up well when df2 is large but perhaps someone more knowledgeable than I can suggest an improvement.
def strmerge(strcolumn):
for i in df2['column_common']:
if strcolumn in i:
return df2[df2['column_common'] == i]['column_b'].values[0]
df1['column_b'] = df1['column_common'].apply(strmerge)
df1
column_a column_common column_b
0 John code Moore
1 Michael other Cohen
2 Dan ome Smith
3 George no match None
4 Adam word Faber