Matching balanced parenthesis in Ruby using recursive regular expressions like perl
I have been looking for a way to match balanced parenthesis in a regex and found a way in Perl, that uses a recursive regular expression:
my $re;
$re = qr{
\(
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(??{ $re }) # Group with matching parens
)*
\)
}x;
from the perl regular expression site .
Is there a way to do this in Ruby or a similar language?
UPDATE:
For those interested here are some interesting links:
Oniguruma manual - from Sawa's answer.
Pragmatic Programmers' Ruby 1.9 Regular Expressions Sample Chapter
Yes. With oniguruma regex engine, which is built in in Ruby 1.9, and is installable on Ruby 1.8, you can do that. You name a subregex with (?<name>...)
or (?'name'...)
. Then you call a subregex with \g<name>
or \g'name'
within the same regex. So your regex translated to oniguruma regex will be:
re = %r{
(?<re>
\(
(?:
(?> [^()]+ )
|
\g<re>
)*
\)
)
}x
Also note that multi-byte string module in PHP >=5 uses oniguruma regex engine, so you will be able to do the same.
The manual for oniguruma is here.
I like the above solution but frequently one wishes to ignore escaped characters. Assuming that \ escapes the following character the following regex handles escaped characters as well.
ESC= /(?<![\\])(?>[\\](?:[\\][\\])*)/
UNESC= /(?:\A|(?<=[^\\]))(?:[\\][\\])*/
BALANCED_PARENS = /#{UNESC}(
(?<bal>\(
(?>
(?> (?:#{ESC}\(|#{ESC}\)|[^()])+ )
|\g<bal>
)*
\)) ) /xm
Given the limitations of negative lookbehind the part delimited by matching parens will be the first capture not the whole match (the whole match may contain leading escaped backslashes).
The reason for the complexity of ESC and UNESC is the assumption that a \\ is an escaped backslash. We only use the UNESC sequence before the initial paren match since any other escaped parenthesis will be matched inside the atomic group and never backtracked. Indeed, if we tried to use the UNESC prefix for either an interior or final paren match it would fail when [^()] inside the atomic group matched the leading \'s and refused to backtrack.
This regex will scan for the first paren that delimits a validly balanced parenthetical. Thus, given the string " ( ( stuff )" it will match "( stuff )". Frequently, the desired behavior is to locate the first (unescaped) parenthesis and either match the interior (if balanced) or fail to match. Unfortunately, atomic grouping won't stop the entire regex from being backed out of and a match attempted at a later point so we must anchor at the start of the string and only look at the 1st capture. The following regex makes this change:
BALANCED_PARENS = /\A(?:#{ESC}\(|#{ESC}\)|[^()])*+
(?<match>\(
(?<bal>
(?>
(?> (?:#{ESC}\(|#{ESC}\)|[^()])+ )
|\(\g<bal>
)*
\)) ) /xm