Is it possible to match nested brackets with a regex without using recursion or balancing groups?
The problem: Match an arbitrarily nested group of brackets in a flavour of regex such as Java's java.util.regex that supports neither recursion nor balancing groups. I.e., match the three outer groups in:
(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))
This exercise is purely academic, since we all know that regular expressions are not supposed to be used to match these things, just as Q-tips are not supposed to be used to clean ears.
Stack Overflow encourages self-answered questions, so I decided to create this post to share something I recently discovered.
Solution 1:
Indeed! It's possible using forward references:
(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)
Proof
Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.
No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.
That's great and all, but I want to match inner groups too!
OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:
(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$)))
Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.
So... how the hell does this actually work?
I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:
Note | Component | Description |
---|---|---|
(?=\() |
Make sure '(' follows before doing any hard work. | |
(?: |
Start of group used to iterate through the string, so the following lookaheads match repeatedly. | |
Handle '(' | (?= |
This lookahead deals with finding the next '('. |
.*?\((?!.*?\1) |
Match up until the next '(' that is not followed by \1 . Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again |
|
(.*\)(?!.*\2).*) |
Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads. |
|
) |
||
Handle ')' | (?= |
This lookahead deals with finding the next ')' |
.*?\)(?!.*?\2) |
Match up until the next ')' that is not followed by \2 . Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before. |
|
(.*) |
Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient. | |
) |
||
. |
Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point. | |
)+? |
Match as few times as possible until a balanced group has been found. This is validated by the following check | |
Final validation | .*?(?=\1) |
Match up to and including the last '(' found. |
[^(]*(?=\2$) |
Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group). |
Conclusion
So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regular expression features - no recursion or balanced groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.
I know a lot of you use regular expressions to accomplish and help other users accomplish simpler and more practical tasks, but if there is anyone out there who shares my excitement for pushing the limits of possibility with regular expressions then I'd love to hear from you. If there is interest, I have other similar material to post.
Solution 2:
Brief
Input Corrections
First of all, your input is incorrect as there's an extra parenthesis (as shown below)
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Making appropriate modifications to either include or exclude the additional parenthesis, one might end up with one of the following strings:
Extra parenthesis removed
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
^
Additional parenthesis added to match extra closing parenthesis
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Regex Capabilities
Second of all, this is really only truly possible in regex flavours that include the recursion capability since any other method will not properly match opening/closing brackets (as seen in the OP's solution, it matches the extra parenthesis from the incorrect input as noted above).
This means that for regex flavours that do not currently support recursion (Java, Python, JavaScript, etc.), recursion (or attempts at mimicking recursion) in regular expressions is not possible.
Input
Considering the original input is actually invalid, we'll use the following inputs to test against.
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
Testing against these inputs should yield the following results:
- INVALID (no match)
- VALID (match)
- VALID (match)
Code
There are multiple ways of matching nested groups. The solutions provided below all depend on regex flavours that include recursion capabilities (e.g. PCRE).
See regex in use here
Using DEFINE block
(?(DEFINE)
(?<value>[^()\r\n]+)
(?<groupVal>(?&group)|(?&value))
(?<group>(?&value)*\((?&groupVal)\)(?&groupVal)*)
)
^(?&group)$
Note: This regex uses the flags gmx
Without DEFINE block
See regex in use here
^(?<group>
(?<value>[^()\r\n]+)*
\((?<groupVal>(?&group)|(?&value))\)
(?&groupVal)*
)$
Note: This regex uses the flags gmx
Without x modifier (one-liner)
See regex in use here
^(?<group>(?<value>[^()\r\n]+)*\((?<groupVal>(?&group)|(?&value))\)(?&groupVal)*)$
Without named (groups & references)
See regex in use here
^(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*)$
Note: This is the shortest possible method that I could come up with.
Explanation
I'll explain the last regex as it's a simplified and minimal example of all the other regular expressions above it.
-
^
Assert position at the start of the line -
(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*)
Capture the following into capture group 1-
([^()\r\n]+)*
Capture the following into capture group 2 any number of times-
[^()\r\n]+
Match any character not present in the set()\r\n
one or more times
-
-
\(
Match a left/opening parenthesis character(
literally -
((?1)|(?2))
Capture either of the following into capture group 3-
(?1)
Recurse the first subpattern (1) -
(?2)
Recurse the second subpattern (2)
-
-
\)
Match a right/closing parenthesis character)
literally -
(?3)*
Recurse the third subpattern (3) any number of times
-
-
$
Assert position at the end of the line