Extract sentences that contain certain words using Regex
Say I have the following string :
txt = 'the car is running, the car has wheels, wheels are round, the road is clear, wheels make the car go'
What I'm trying to do is get all sentences (that is between commas) that have either 'car' or 'wheels' in them. Using regex, I did the following :
re.findall('[^,]*{}|{}[^,]*'.format('car', 'wheels'), txt)
And I get this result :
['the car', ' the car', 'wheels', 'wheels are round', ' wheels make the car']
Apparently, it only gives back what's between the words 'car' and 'wheels', and it seems like the order matters. What I'm trying to get is this :
['the car is running', 'the car has wheels', 'wheels are round', 'wheels make the car go']
Any ideas about how to do this ?
Your regex of
re.findall('[^,]*{}|{}[^,]*'.format('car', 'wheels'), txt)
only needs small a modificiation, the inclusion of a (non-capturing) group, otherwise the |
applies to the entire regex, as opposed to just the car|wheels
.
Your new regex will be
re.findall('[^,]*(?:{}|{})[^,]*'.format('car', 'wheels'), txt)
This outputs:
['the car is running', ' the car has wheels', ' wheels are round', ' wheels make the car go']
However, I don't think regex is suitable for this problem. I would advise the following solution instead:
txt = 'the car is running, the car has wheels, wheels are round, the road is clear, wheels make the car go'
# Either:
sentences = [sentence.strip() for sentence in txt.split(",") if "car" in sentence or "wheels" in sentence]
# Or alternatively:
words = ["car", "wheels"]
sentences = [
sentence.strip() # Remove spaces before and after the sentence
for sentence in txt.split(",")
if any(
word in sentence
for word in words
)
]
# This second method allows for checking for more than just 2 words
This outputs:
['the car is running', 'the car has wheels', 'wheels are round', 'wheels make the car go']
Performance
The performance of the two methods (list comprehension and regex) can be compared with the following script, which runs the code in the strings 100 times, for a text with 40k sentences.
import timeit
import re
# Set up a testing text with 40k sentences.
txt = (
"the car is running, the car has wheels, wheels are round, the road is clear, "
* 10000
)
# The (simple) list comprehension strategy
list_comp_time = timeit.timeit(
'[sentence for sentence in txt.split(",") if "car" in sentence or "wheels" in sentence]',
globals=globals(),
number=100,
)
# A strategy using regex
regex_time = timeit.timeit(
"re.findall('[^,]*(?:{}|{})[^,]*'.format('car', 'wheels'), txt)",
globals=globals(),
number=100,
)
print(f"The List Comprehension method took {list_comp_time:.8f}s")
print(f"The Regex method took {regex_time:.8f}s")
The output is:
The List Comprehension method took 0.48497320s
The Regex method took 3.71355870s
With other words, the List Comprehension method is more time-efficient.
import re
txt = 'the car is running, the car has wheels, wheels are round, the road is clear, wheels make the car go'
print(re.findall(r"(?<!,)[^,]*(?:car|wheels)[^,]*", txt))
Output:
['the car is running', 'the car has wheels', 'wheels are round', 'wheels make the car go']
This approach can help but without any usage of regex
:
txt = 'the car is running, the car has wheels, wheels are round, the road is clear, wheels make the car go'
result = [ x for x in txt.split(',') if "car" in x or "wheel" in x]
print(result)
Output:
['the car is running', ' the car has wheels', ' wheels are round', ' wheels make the car go']