How does the regular expression ‘(?<=#)[^#]+(?=#)’ work?
I have the following regex in a C# program, and have difficulties understanding it:
(?<=#)[^#]+(?=#)
I'll break it down to what I think I understood:
(?<=#) a group, matching a hash. what's `?<=`?
[^#]+ one or more non-hashes (used to achieve non-greediness)
(?=#) another group, matching a hash. what's the `?=`?
So the problem I have is the ?<=
and ?<
part. From reading MSDN, ?<name>
is used for naming groups, but in this case the angle bracket is never closed.
I couldn't find ?=
in the docs, and searching for it is really difficult, because search engines will mostly ignore those special chars.
Solution 1:
They are called lookarounds; they allow you to assert if a pattern matches or not, without actually making the match. There are 4 basic lookarounds:
- Positive lookarounds: see if we CAN match the
pattern
...-
(?=pattern)
- ... to the right of current position (look ahead) -
(?<=pattern)
- ... to the left of current position (look behind)
-
- Negative lookarounds - see if we can NOT match the
pattern
-
(?!pattern)
- ... to the right -
(?<!pattern)
- ... to the left
-
As an easy reminder, for a lookaround:
-
=
is positive,!
is negative -
<
is look behind, otherwise it's look ahead
References
- regular-expressions.info/Lookarounds
But why use lookarounds?
One might argue that lookarounds in the pattern above aren't necessary, and #([^#]+)#
will do the job just fine (extracting the string captured by \1
to get the non-#
).
Not quite. The difference is that since a lookaround doesn't match the #
, it can be "used" again by the next attempt to find a match. Simplistically speaking, lookarounds allow "matches" to overlap.
Consider the following input string:
and #one# and #two# and #three#four#
Now, #([a-z]+)#
will give the following matches (as seen on rubular.com):
and #one# and #two# and #three#four#
\___/ \___/ \_____/
Compare this with (?<=#)[a-z]+(?=#)
, which matches:
and #one# and #two# and #three#four#
\_/ \_/ \___/ \__/
Unfortunately this can't be demonstrated on rubular.com, since it doesn't support lookbehind. However, it does support lookahead, so we can do something similar with #([a-z]+)(?=#)
, which matches (as seen on rubular.com):
and #one# and #two# and #three#four#
\__/ \__/ \____/\___/
References
- regular-expressions.info/Flavor Comparison
Solution 2:
As another poster mentioned, these are lookarounds, special constructs for changing what gets matched and when. This says:
(?<=#) match but don't capture, the string `#`
when followed by the next expression
[^#]+ one or more characters that are not `#`, and
(?=#) match but don't capture, the string `#`
when preceded by the last expression
So this will match all the characters in between two #
s.
Lookaheads and lookbehinds are very useful in many cases. Consider, for example, the rule "match all b
s not followed by an a
." Your first attempt might be something like b[^a]
, but that's not right: this will also match the bu
in bus
or the bo
in boy
, but you only wanted the b
. And it won't match the b
in cab
, even though that's not followed by an a
, because there are no more characters to match.
To do that correctly, you need a lookahead: b(?!a)
. This says "match a b
but don't match an a
afterwards, and don't make that part of the match". Thus it'll match just the b
in bolo
, which is what you want; likewise it'll match the b
in cab
.
Solution 3:
They're called look-arounds: http://www.regular-expressions.info/lookaround.html