What is the use of '\G' anchor in regex?

Solution 1:

UPDATE

\G forces the pattern to only return matches that are part of a continuous chain of matches. From the first match each subsequent match must be preceded by a match. If you break the chain the matches end.

<?php
$pattern = '#(match),#';
$subject = "match,match,match,match,not-match,match";

preg_match_all( $pattern, $subject, $matches );

//Will output match 5 times because it skips over not-match
foreach ( $matches[1] as $match ) {
    echo $match . '<br />';
}

echo '<br />';

$pattern = '#(\Gmatch),#';
$subject = "match,match,match,match,not-match,match";

preg_match_all( $pattern, $subject, $matches );

//Will only output match 4 times because at not-match the chain is broken
foreach ( $matches[1] as $match ) {
    echo $match . '<br />';
}
?>

This is straight from the docs

The fourth use of backslash is for certain simple assertions. An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. The use of subpatterns for more complicated assertions is described below. The backslashed assertions are

 \G
    first matching position in subject

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match(). It differs from \A when the value of offset is non-zero.

http://www.php.net/manual/en/regexp.reference.escape.php

You will have to scroll down that page a bit but there it is.

There is a really good example in ruby but it is the same in php.

How the Anchor \z and \G works in Ruby?

Solution 2:

\G will match the match boundary, which is either the beginning of the string, or the point where the last character of last match is consumed.

It is particularly useful when you need to do complex tokenization, while also making sure that the tokens are valid.

Example problem

Let us take the example of tokenizing this input:

input 'some input in quote' more input   '\'escaped quote\''   lots@_$of_fun    ' \' \\  ' crazy'stuff'

Into these tokens (I use ~ to denote end of string):

input~
some input in quote~
more~
input~
'escaped quote'~
lots@_$of_fun~
 ' \  ~
crazy~
stuff~

The string consists of a mix of:

  • Singly quoted string, which allows the escape of \ and ', and spaces are conserved. Empty string can be specified using singly quoted string.
  • OR unquoted string, which consists of a sequence of non-white-space characters, and does not contain \ or '.
  • Space between 2 unquoted string will delimit them. Space is not necessary to delimit other cases.

For the sake of simplicity, let us assume the input does not contain new line (in real case, you need to consider it). It will add to the complexity of the regex without demonstrating the point.

The RAW regex for singly quoted string is '(?:[^\\']|\\[\\'])*+'
And the RAW regex for unquoted string is [^\s'\\]++
You don't need to care too much about the 2 piece of regex above, though.

The solution below with \G can make sure that when the engine fails to find any match, all characters from the beginning of the string to the position of last match has been consumed. Since it cannot skip character, the engine will stop matching when it fails to find valid match for both specifications of tokens, rather than grabbing random stuff in the rest of the string.

Construction

At the first step of construction, we can put together this regex:

\G(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))

Or simply put (this is not regex - just to make it easier to read):

\G(Singly_quote_regex|Unquoted_regex)

This will match the first token only, since when it attempts matching for the 2nd time, the match stops at the space before 'some input....


We just need to add a bit to allow for 0 or more space, so that in the subsequent match, the space at the position left off by the last match is consumed:

\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))

The regex above will now correctly identify the tokens, as seen here.


The regex can be further modified so that it returns the rest of the string when the engine fails to retrieve any valid token:

\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++)|((?s).+$))

Since the alternation is tried in order from left-to-right, the last alternative ((?s).+$) will be match if and only if the string ahead doesn't make up a valid single quoted or unquoted token. This can be used to check for error.

The first capturing group will contain the text inside single quoted string, which needs extra processing to turn into the desired text (it is not really relevant here, so I leave it as an exercise to the readers). The second capturing group will contain the unquoted string. And the third capturing group acts as an indicator that the input string is not valid.

Demo for the final regex

Conclusion

The above example is demonstrate of one scenario of usage of \G in tokenization. There can be other usages that I haven't come across.