How do (*SKIP) or (*F) work on regex?
I'm learning an advanced usage of regex and noticed that many posts use (*SKIP)
or (*F)
in it.
I posted a question where the idea was to match lines that don't have yellow
but has blue
only if brown
exists after blue. And the right answer was:
.*yellow.*(*SKIP)(*F)|^.*\bblue\b(?=.*brown).*$
I also have tried lookaround expressions like below but haven't worked for all the cases:
^((?!yellow).)*blue(?=.*brown).*$
I had no idea about these (*SKIP)(*F)
flags, so the question is, how do these flag works? What do they do? And are there other flags like these?
Thanks.
Solution 1:
These two backtracking control verbs are implemented only in Perl, PCRE and the pypi regex module.
The idea of the (*SKIP)(*FAIL)
trick is to consume characters that you want to avoid, and that must not be a part of the match result.
A classical pattern that uses of this trick looks like that:
What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match
A regex engine processes a string like that:
the first token of the pattern is tested on each character from left to right (by default most of the time, but some regex engines can be set to work from right to left, .net can do this if I remember well)
if the first token matches, then the regex engine tests the next token of the pattern with the next characters (after the first token match) etc.
when a token fails, the regex engine gets the characters matched by the last token back and tries another way to make the pattern succeed (if it doesn't work too, the regex engine do the same with the previous token etc.)
When the regex engine meets the (*SKIP)
verb (in this case all previous tokens have obviously succeeded), it has no right anymore to go back to all the previous tokens on the left and has no right anymore to retry all the matched characters with another branch of the pattern or at the next position in the string until the last matched character (included) if the pattern fails later on the right of the (*SKIP)
verb.
The role of (*FAIL)
is to force the pattern to fail. Thus all the characters matched on the left of (*SKIP)
are skipped and the regex engine continues its job after these characters.
The only possibility for the pattern to succeed in the example pattern is that the first branch fails before (*SKIP)
to allow the second branch to be tested.
You can find another kind of explanation here.
About Java and other regex engines that don't have these two features
Backtracking control verbs are not implemented in other regex engines and there are no equivalent.
However, you can use several ways to do the same (to be more clear, to avoid something that can be possibly matched by an other part of the pattern).
The use of capture groups:
way 1:
What_I_want_to_avoid|(What_I_want_to_match)
You only need to extract the capture group 1 (or to test if it exists), since it is what you are looking for. If you use the pattern to perform a replacement, you can use the properties of the match result (offset, length, capture group) to make the replacement with classical string functions. Other language like javascript, ruby... allows to use a callback function as replacement.
way 2:
((?>To_avoid|Other_things_that_can_be_before_what_i_want)*)(What_I_want)
It's the more easy way for the replacement, no need to callback function, the replacement string need only to begin with \1
(or $1
)
The use of lookarounds:
example, you want to find a word that is not embedded between two other words (lets say S_word
and E_word
that are different (see Qtax comment)):
(the edge cases S_word E_word word E_word
and S_word word S_word E_word
are allowed in this example.)
The backtracking control verb way will be:
S_word not_S_word_or_E_word E_word(*SKIP)(*F)|word
To use this way the regex engine needs to allow variable length lookbehinds to a certain extent. With .net or the new regex module, no problems, lookbehinds can have a totally variable length. It is possible with Java too but the size must be limited (example: (?<=.{1,1000})
).
The Java equivalent will be:
word(?:(?!not_S_word_or_E_word E_word)|(?<!S_word not_E_word{0,1000} word))
Note that in some cases, only the lookahead is necessary. Note too that starting a pattern with literal character is more efficient than starting with a lookbehind, that's why I putted it after the word (even if I need to rewrite the word one more time in the assertion.)
Solution 2:
The (*SKIP)
and (*F)
(aka *FAIL
) patterns are documented in the Perl manual: http://perldoc.perl.org/perlre.html
However, they are only available in Perl and in flavours of regex that imitate Perl (for example the PCRE library used by PHP).
Java's built in regex engine does not support these extensions, and I am not aware of one that does.
My general advice in Java is to keep your regular expressions simple, and use other string manipulation methods to achieve what can't be done clearly with a short regex.