What are regular expression Balancing Groups?
Solution 1:
As far as I know, balancing groups are unique to .NET's regex flavor.
Aside: Repeated Groups
First, you need to know that .NET is (again, as far as I know) the only regex flavor that lets you access multiple captures of a single capturing group (not in backreferences but after the match has completed).
To illustrate this with an example, consider the pattern
(.)+
and the string "abcd"
.
in all other regex flavors, capturing group 1
will simply yield one result: d
(note, the full match will of course be abcd
as expected). This is because every new use of the capturing group overwrites the previous capture.
.NET on the other hand remembers them all. And it does so in a stack. After matching the above regex like
Match m = new Regex(@"(.)+").Match("abcd");
you will find that
m.Groups[1].Captures
Is a CaptureCollection
whose elements correspond to the four captures
0: "a"
1: "b"
2: "c"
3: "d"
where the number is the index into the CaptureCollection
. So basically every time the group is used again, a new capture is pushed onto the stack.
It gets more interesting if we are using named capturing groups. Because .NET allows repeated use of the same name we could write a regex like
(?<word>\w+)\W+(?<word>\w+)
to capture two words into the same group. Again, every time a group with a certain name is encountered, a capture is pushed onto its stack. So applying this regex to the input "foo bar"
and inspecting
m.Groups["word"].Captures
we find two captures
0: "foo"
1: "bar"
This allows us to even push things onto a single stack from different parts of the expression. But still, this is just .NET's feature of being able to track multiple captures which are listed in this CaptureCollection
. But I said, this collection is a stack. So can we pop things from it?
Enter: Balancing Groups
It turns out we can. If we use a group like (?<-word>...)
, then the last capture is popped from the stack word
if the subexpression ...
matches. So if we change our previous expression to
(?<word>\w+)\W+(?<-word>\w+)
Then the second group will pop the first group's capture, and we will receive an empty CaptureCollection
in the end. Of course, this example is pretty useless.
But there's one more detail to the minus-syntax: if the stack is already empty, the group fails (regardless of its subpattern). We can leverage this behavior to count nesting levels - and this is where the name balancing group comes from (and where it gets interesting). Say we want to match strings that are correctly parenthesized. We push each opening parenthesis on the stack, and pop one capture for each closing parenthesis. If we encounter one closing parenthesis too many, it will try to pop an empty stack and cause the pattern to fail:
^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*$
So we have three alternatives in a repetition. The first alternative consumes everything that is not a parenthesis. The second alternative matches (
s while pushing them onto the stack. The third alternative matches )
s while popping elements from the stack (if possible!).
Note: Just to clarify, we're only checking that there are no unmatched parentheses! This means that string containing no parentheses at all will match, because they are still syntactically valid (in some syntax where you need your parentheses to match). If you want to ensure at least one set of parentheses, simply add a lookahead (?=.*[(])
right after the ^
.
This pattern is not perfect (or entirely correct) though.
Finale: Conditional Patterns
There is one more catch: this does not ensure that the stack is empty at the end of the string (hence (foo(bar)
would be valid). .NET (and many other flavors) have one more construct that helps us out here: conditional patterns. The general syntax is
(?(condition)truePattern|falsePattern)
where the falsePattern
is optional - if it is omitted the false-case will always match. The condition can either be a pattern, or the name of a capturing group. I'll focus on the latter case here. If it's the name of a capturing group, then truePattern
is used if and only if the capture stack for that particular group is not empty. That is, a conditional pattern like (?(name)yes|no)
reads "if name
has matched and captured something (that is still on the stack), use pattern yes
otherwise use pattern no
".
So at the end of our above pattern we could add something like (?(Open)failPattern)
which causes the entire pattern to fail, if the Open
-stack is not empty. The simplest thing to make the pattern unconditionally fail is (?!)
(an empty negative lookahead). So we have our final pattern:
^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))$
Note that this conditional syntax has nothing per se to do with balancing groups but it's necessary to harness their full power.
From here, the sky is the limit. Many very sophisticated uses are possible and there are some gotchas when used in combination with other .NET-Regex features like variable-length lookbehinds (which I had to learn the hard way myself). The main question however is always: is your code still maintainable when using these features? You need to document it really well, and be sure that everyone who works on it is also aware of these features. Otherwise you might be better off, just walking the string manually character-by-character and counting nesting levels in an integer.
Addendum: What's with the (?<A-B>...)
syntax?
Credits for this part go to Kobi (see his answer below for more details).
Now with all of the above, we can validate that a string is correctly parenthesized. But it would be a lot more useful, if we could actually get (nested) captures for all those parentheses' contents. Of course, we could remember opening and closing parentheses in a separate capture stack that is not emptied, and then do some substring extraction based on their positions in a separate step.
But .NET provides one more convenience feature here: if we use (?<A-B>subPattern)
, not only is a capture popped from stack B
, but also everything between that popped capture of B
and this current group is pushed onto stack A
. So if we use a group like this for the closing parentheses, while popping nesting levels from our stack, we can also push the pair's content onto another stack:
^(?:[^()]|(?<Open>[(])|(?<Content-Open>[)]))*(?(Open)(?!))$
Kobi provided this Live-Demo in his answer
So taking all of these things together we can:
- Remember arbitrarily many captures
- Validate nested structures
- Capture each nesting level
All in a single regular expression. If that's not exciting... ;)
Some resources that I found helpful when I first learned about them:
- http://blog.stevenlevithan.com/archives/balancing-groups
- MSDN on balancing groups
- MSDN on conditional patterns
- http://kobikobi.wordpress.com/tag/balancing-group/ (slightly academic, but has some interesting applications)
Solution 2:
Just a small addition to M. Buettner's excellent answer:
What's the deal with the (?<A-B>)
syntax?
(?<A-B>x)
is subtly different from (?<-A>(?<B>x))
. They result in the same control flow*, but they capture differently.
For example, let's look at a pattern for balanced braces:
(?:[^{}]|(?<B>{)|(?<-B>}))+(?(B)(?!))
At the end of the match we do have a balanced string, but that is all we have - we don't know where the braces are because the B
stack is empty. The hard work the engine did for us is gone.
(example on Regex Storm)
(?<A-B>x)
is the solution for that problem. How? It doesn't capture x
into $A
: it captures the content between the previous capture of B
and the current position.
Let's use it in our pattern:
(?:[^{}]|(?<Open>{)|(?<Content-Open>}))+(?(Open)(?!))
This would capture into $Content
the strings between the braces (and their positions), for each pair along the way.
For the string {1 2 {3} {4 5 {6}} 7}
there'd be four captures: 3
, 6
,4 5 {6}
, and 1 2 {3} {4 5 {6}} 7
- much better than nothing or }
}
}
}
.
(example - click the table
tab and look at ${Content}
, captures)
In fact, it can be used without balancing at all: (?<A>).(.(?<Content-A>).)
captures the first two characters, even though they are separated by groups.
(a lookahead is more commonly used here but it doesn't always scale: it may duplicate your logic.)
(?<A-B>)
is a strong feature - it gives you exact control over your captures. Keep that in mind when you're trying to get more out of your pattern.