Extract sentences that contain certain words using Regex

Say I have the following string :

txt = 'the car is running, the car has wheels, wheels are round, the road is clear, wheels make the car go'

What I'm trying to do is get all sentences (that is between commas) that have either 'car' or 'wheels' in them. Using regex, I did the following :

re.findall('[^,]*{}|{}[^,]*'.format('car', 'wheels'), txt)

And I get this result :

['the car', ' the car', 'wheels', 'wheels are round', ' wheels make the car']

Apparently, it only gives back what's between the words 'car' and 'wheels', and it seems like the order matters. What I'm trying to get is this :

['the car is running', 'the car has wheels', 'wheels are round', 'wheels make the car go']

Any ideas about how to do this ?


Your regex of

re.findall('[^,]*{}|{}[^,]*'.format('car', 'wheels'), txt)

only needs small a modificiation, the inclusion of a (non-capturing) group, otherwise the | applies to the entire regex, as opposed to just the car|wheels.

Your new regex will be

re.findall('[^,]*(?:{}|{})[^,]*'.format('car', 'wheels'), txt)

This outputs:

['the car is running', ' the car has wheels', ' wheels are round', ' wheels make the car go']

However, I don't think regex is suitable for this problem. I would advise the following solution instead:

txt = 'the car is running, the car has wheels, wheels are round, the road is clear, wheels make the car go'
# Either:
sentences = [sentence.strip() for sentence in txt.split(",") if "car" in sentence or "wheels" in sentence]
# Or alternatively:
words = ["car", "wheels"]
sentences = [
    sentence.strip() # Remove spaces before and after the sentence
    for sentence in txt.split(",")
    if any(
        word in sentence
        for word in words
    )
]
# This second method allows for checking for more than just 2 words

This outputs:

['the car is running', 'the car has wheels', 'wheels are round', 'wheels make the car go']

Performance

The performance of the two methods (list comprehension and regex) can be compared with the following script, which runs the code in the strings 100 times, for a text with 40k sentences.

import timeit
import re

# Set up a testing text with 40k sentences.
txt = (
    "the car is running, the car has wheels, wheels are round, the road is clear, "
    * 10000
)

# The (simple) list comprehension strategy
list_comp_time = timeit.timeit(
    '[sentence for sentence in txt.split(",") if "car" in sentence or "wheels" in sentence]',
    globals=globals(),
    number=100,
)

# A strategy using regex
regex_time = timeit.timeit(
    "re.findall('[^,]*(?:{}|{})[^,]*'.format('car', 'wheels'), txt)",
    globals=globals(),
    number=100,
)

print(f"The List Comprehension method took {list_comp_time:.8f}s")
print(f"The Regex method took {regex_time:.8f}s")

The output is:

The List Comprehension method took 0.48497320s
The Regex method took 3.71355870s

With other words, the List Comprehension method is more time-efficient.


import re

txt = 'the car is running, the car has wheels, wheels are round, the road is clear, wheels make the car go'

print(re.findall(r"(?<!,)[^,]*(?:car|wheels)[^,]*", txt))

Output:

['the car is running', 'the car has wheels', 'wheels are round', 'wheels make the car go']

This approach can help but without any usage of regex:

txt = 'the car is running, the car has wheels, wheels are round, the road is clear, wheels make the car go'

result = [ x for x in txt.split(',') if "car" in x or "wheel" in x]

print(result)

Output:

['the car is running', ' the car has wheels', ' wheels are round', ' wheels make the car go']