Unroll Loop, when to use
I'm trying to understand unroll loops in regex. What is the big difference between:
MINISTÉRIO[\s\S]*?PÁG
and
MINISTÉRIO(?:[^P]*(?:P(?!ÁG\s:\s\d+\/\d+)[^P]*)(?:[\s\S]*?))PÁG
In this context:
http://regexr.com/3dmlr
Why should i use the second, if the first do the SAME thing?
Thanks.
What is Unroll-the-loop
See this Unroll the loop technique source:
This optimisation thechnique is used to optimize repeated alternation of the form
(expr1|expr2|...)*
. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression(a*)*
.The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
So, this is an optimization technique where alternations are turned into linearly matching atoms.
This makes these unrolled patterns very efficient since they involve less backtracking.
Current Scenario
Your MINISTÉRIO[\s\S]*?PÁG
is a non-unrolled pattern while MINISTÉRIO[^P]*(?:P(?!ÁG)[^P]*)*PÁG
is. See the demos (both saved with PCRE option to show the number of steps in the box above. Regex performance is different across regex engines, but this will tell you exactly the performance difference). Add more text after text
: the first regex will start requiring more steps to finish, the second one will only show more steps after adding P
. So, in texts where the character you used in the known part is not common, unrolled patterns are very efficient.
See the Difference between .*?
, .*
and [^"]*+
quantifiers section in my answer to understand how lazy matching works (your [\s\S]*?
is the same as .*?
with a DOTALL modifier in languages that allow a .
to match a newline, too).
Performance Question
Is the lazy matching pattern always slow and inefficient? It is not always so. With very short strings, lazy dot matching is usually better (1-10 symbols). When we talk about long inputs, where there can be the leading delimiter, and no trailing one, this may lead to excessive backtracking leading to time out issues.
Use unrolled patterns when you have arbitrary inputs of potentially long length and where there may be no match.
Use lazy matching when your input is controlled, you know there will always be a match, some known set log formats, or the like.
Bonus: Commonly Unrolled patterns
Tempered greedy tokens
Regular string literals (
"String\u0020:\"text\""
):"[^"\\]*(?:\\.[^"\\]*)*"
Multiline comment regex (
/* Comments */
):/\*[^*]*\*+(?:[^/*][^*]*\*+)*/
@<...>@
comment regex:@<[^>]*(?:>[^@]*)*@