Find out number of capture groups in Python regular expressions
Is there a way to determine how many capture groups there are in a given regular expression?
I would like to be able to do the follwing:
def groups(regexp, s):
""" Returns the first result of re.findall, or an empty default
>>> groups(r'(\d)(\d)(\d)', '123')
('1', '2', '3')
>>> groups(r'(\d)(\d)(\d)', 'abc')
('', '', '')
"""
import re
m = re.search(regexp, s)
if m:
return m.groups()
return ('',) * num_of_groups(regexp)
This allows me to do stuff like:
first, last, phone = groups(r'(\w+) (\w+) ([\d\-]+)', 'John Doe 555-3456')
However, I don't know how to implement num_of_groups
. (Currently I just work around it.)
EDIT: Following the advice from rslite, I replaced re.findall
with re.search
.
sre_parse
seems like the most robust and comprehensive solution, but requires tree traversal and appears to be a bit heavy.
MizardX's regular expression seems to cover all bases, so I'm going to go with that.
Solution 1:
def num_groups(regex):
return re.compile(regex).groups
Solution 2:
f_x = re.search(...)
len_groups = len(f_x.groups())
Solution 3:
Something from inside sre_parse might help.
At first glance, maybe something along the lines of:
>>> import sre_parse
>>> sre_parse.parse('(\d)\d(\d)')
[('subpattern', (1, [('in', [('category', 'category_digit')])])),
('in', [('category', 'category_digit')]),
('subpattern', (2, [('in', [('category', 'category_digit')])]))]
I.e. count the items of type 'subpattern':
import sre_parse
def count_patterns(regex):
"""
>>> count_patterns('foo: \d')
0
>>> count_patterns('foo: (\d)')
1
>>> count_patterns('foo: (\d(\s))')
1
"""
parsed = sre_parse.parse(regex)
return len([token for token in parsed if token[0] == 'subpattern'])
Note that we're only counting root level patterns here, so the last example only returns 1. To change this, tokens would need to searched recursively.
Solution 4:
First of all if you only need the first result of re.findall it's better to just use re.search that returns a match or None.
For the groups number you could count the number of open parenthesis '(' except those that are escaped by '\'. You could use another regex for that:
def num_of_groups(regexp):
rg = re.compile(r'(?<!\\)\(')
return len(rg.findall(regexp))
Note that this doesn't work if the regex contains non-capturing groups and also if '(' is escaped by using it as '[(]'. So this is not very reliable. But depending on the regexes that you use it might help.