pandas dataframe str.contains() AND operation

I have a df (Pandas Dataframe) with three rows:

some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"

The function df.col_name.str.contains("apple|banana") will catch all of the rows:

"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".

How do I apply AND operator to the str.contains() method, so that it only grabs strings that contain BOTH "apple" & "banana"?

"apple and banana both are delicious"

I'd like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, ..., etc.)

You can do that as follows:

df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]

You can also do it in regex expression style:

df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]

You can then, build your list of words into a regex string like so:

base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat']  # example
base.format(''.join(expr.format(w) for w in words))

will render:

'^(?=.*apple)(?=.*banana)(?=.*cat)'

Then you can do your stuff dynamically.

df = pd.DataFrame({'col': ["apple is delicious",
                           "banana is delicious",
                           "apple and banana both are delicious"]})

targets = ['apple', 'banana']

# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0    True
1    True
2    True
Name: col, dtype: bool

# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0    False
1    False
2     True
Name: col, dtype: bool

This works

df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)

If you only want to use native methods and avoid writing regexps, here is a vectorized version with no lambdas involved:

targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]

pandas dataframe str.contains() AND operation

Related

Recent Posts