Recursive pattern in regex
Solution 1:
The pattern is:
{((?>[^{}]+|(?R))*)}
You can see this works for your example:
regex.findall("{((?>[^{}]+|(?R))*)}", "{1, {2, 3}} {4, 5}")
# ['1, {2, 3}', '4, 5']
Explanation:
The m part needs to exclude the brackets. The use of an atomic group is needed if you want at the same time to allow a quantifier for [^{}]
and to repeat the group without catastropic backtracking problems. To be more clear, if the last closing curly bracket is missing this regex engine will backtrack atomic group by atomic group instead of character by character. To drive home this point, you can make the quantifier possessive like that: {((?>[^{}]+|(?R))*+)}
(or {((?:[^{}]+|(?R))*+)}
since the atomic group is no more useful).
The atomic group (?>....)
and the possessive quantifier ?+
, *+
, ++
are the two sides of the same feature. This feature forbids the regex engine to backtrack inside the group of characters that becomes an "atom" (something you can't divide in smaller parts).
The basic examples are the following two patterns that always fail for the string aaaaaaaaaab
:
(?>a+)ab
a++ab
that is:
regex.match("a++ab", "aaaaaaaaaab")
regex.match("(?>a+)ab", "aaaaaaaaaab")
When you use (?:a+)
or a+
the regex engine (by default) records (in prevision) all backtracking positions for all characters. But when you use an atomic group or a possessive quantifier, theses backtracking positions are no more recorded (except for the begining of the group). So when the backtracking mechanism occurs the last "a" character can't be given back. Only the entire group can be given back.
[EDIT]: the pattern can be written in a more efficient way if you use an "unrolled" subpattern to describe the content between brackets:
{([^{}]*+(?:(?R)[^{}]*)*+)}
Solution 2:
I was able to do this no problem with the b(?:m|(?R))*e
syntax:
{((?:[^{}]|(?R))*)}
Demo
I think the key from what you were attempting is that the repetition doesn't go on m
, but the entire (?:m|(?R))
group. This is what allows the recursion with the (?R)
reference.