How to match any string from a list of strings in regular expressions in python?
Lets say I have a list of strings,
string_lst = ['fun', 'dum', 'sun', 'gum']
I want to make a regular expression, where at a point in it, I can match any of the strings i have in that list, within a group, such as this:
import re
template = re.compile(r".*(elem for elem in string_lst).*")
template.match("I love to have fun.")
What would be the correct way to do this? Or would one have to make multiple regular expressions and match them all separately to the string?
Solution 1:
Join the list on the pipe character |
, which represents different options in regex.
string_lst = ['fun', 'dum', 'sun', 'gum']
x="I love to have fun."
print re.findall(r"(?=("+'|'.join(string_lst)+r"))", x)
Output: ['fun']
You cannot use match
as it will match from start.
Using search
you will get only the first match. So use findall
instead.
Also use lookahead if you have overlapping matches not starting at the same point.
Solution 2:
regex
module has named lists (sets actually):
#!/usr/bin/env python
import regex as re # $ pip install regex
p = re.compile(r"\L<words>", words=['fun', 'dum', 'sun', 'gum'])
if p.search("I love to have fun."):
print('matched')
Here words
is just a name, you can use anything you like instead..search()
methods is used instead of .*
before/after the named list.
To emulate named lists using stdlib's re
module:
#!/usr/bin/env python
import re
words = ['fun', 'dum', 'sun', 'gum']
longest_first = sorted(words, key=len, reverse=True)
p = re.compile(r'(?:{})'.format('|'.join(map(re.escape, longest_first))))
if p.search("I love to have fun."):
print('matched')
re.escape()
is used to escape regex meta-characters such as .*?
inside individual words (to match the words literally).sorted()
emulates regex
behavior and it puts the longest words first among the alternatives, compare:
>>> import re
>>> re.findall("(funny|fun)", "it is funny")
['funny']
>>> re.findall("(fun|funny)", "it is funny")
['fun']
>>> import regex
>>> regex.findall(r"\L<words>", "it is funny", words=['fun', 'funny'])
['funny']
>>> regex.findall(r"\L<words>", "it is funny", words=['funny', 'fun'])
['funny']