Tempered Greedy Token - What is different about placing the dot before the negative lookahead?
<table((?!</table>).)*</table>
matches all my table tags. However,
<table(.(?!</table>))*</table>
does not. The second one seems to make sense if I try to write out the expression in words, but I can't make sense of the first.
What is the difference ?
For reference, I got the term "Tempered Greedy Token" from here: Tempered Greedy Token Solution
Solution 1:
What is a Tempered Greedy Token?
The rexegg.com tempered greedy token reference is quite concise:
In
(?:(?!{END}).)*
, the*
quantifier applies to a dot, but it is now a tempered dot. The negative lookahead(?!{END})
asserts that what follows the current position is not the string{END}
. Therefore, the dot can never match the opening brace of{END}
, guaranteeing that we won't jump over the{END}
delimiter.
That is it: a tempered greedy token is a kind of a negated character class for a character sequence (cf. negated character class for a single character).
NOTE: The difference between a tempered greedy token and a negated character class is that the former does not really match the text other than the sequence itself, but a single character that does not start that sequence. I.e. (?:(?!abc|xyz).)+
won't match def
in defabc
, but will match def
and bc
, because a
starts the forbidden abc
sequence, and bc
does not.
It consists of:
-
(?:...)*
- a quantified non-capturing group (it may be a capturing group, but it makes no sense to capture each individual character) (a*
can be+
, it depends on whether an empty string match is expected) -
(?!...)
- a negative lookahead that actually imposes a restriction on the value to the right of the current location -
.
- (or any (usually single) character) a consuming pattern.
However, we can always further temper the token by using alternations in the negative lookahead (e.g. (?!{(?:END|START|MID)})
) or by replacing the all-matching dot with a negated character class (e.g. (?:(?!START|END|MID)[^<>])
when trying to match text only inside tags).
Consuming part placement
Note there is no mentioning of a construction where a consuming part (the dot in the original tempered greedy token) is placed before the lookahead. Avinash's answer is explaining that part clearly: (.(?!</table>))*
first matches any character (but a newline without a DOTALL modifier) and then checks if it is not followed with </table>
resulting in a failure to match e
in <table>table</table>
. *The consuming part (the .
) must be placed after the tempering lookahead.
When should we use a tempered greedy token?
Rexegg.com gives an idea:
- When we want to match a block of text between Delimiter 1 and Delimiter 2 with no Substring 3 in-between (e.g.
{START}(?:(?!{(?:MID|RESTART)}).)*?{END}
- When we want to match a block of text containing a specific pattern inside without overflowing subsequent blocks (e.g. instead of lazy dot matching as in
<table>.*?chair.*?</table>
, we'd use something like<table>(?:(?!chair|</?table>).)*chair(?:(?!<table>).)*</table>
). - When we want to match the shortest window possible between 2 strings. Lazy matching won't help when you need to get
abc 2 xyz
fromabc 1 abc 2 xyz
(seeabc.*?xyz
andabc(?:(?!abc).)*?xyz
).
Performance Issue
Tempered greedy token is resource-consuming as a lookahead check is performed after each character matched with the consuming pattern. Unrolling the loop technique can significantly increase tempered greedy token performance.
Say, we want to match abc 2 xyz
in abc 1 abc 2 xyz 3 xyz. Instead of checking each character between abc
and xyz
with abc(?:(?!abc|xyz).)*xyz
, we can skip all characters that are not a
or x
with [^ax]*
, and then match all a
that are not followed with bc
(with a(?!bc)
) and all x
that are not followed with yz
(with x(?!yz)
): abc[^ax]*(?:a(?!bc)[^ax]*|x(?!yz)[^ax]*)*xyz
.
Solution 2:
((?!</table>).)*
would check for that particular character going to be matched must not be a starting character in the string </table>
. If yes, then it only matches that particular character. *
repeats the same zero or more times.
(.(?!</table>))*
matches any character only if it's not followed by </table>
, zero or more times. So this would match all the characters inside the table tag except the last character, since the last character is followed by </table>
. And the following pattern </table>
asserts that there must be a closing table tag at the end of the match. This makes the match to fail.
See here.