Capturing Quantifiers and Quantifier Arithmetic

Solution 1:

I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:

@@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"

you can check if @ = - / are balanced with this pattern that uses the famous Qtax trick, (are you ready?): the "possessive-optional self-referencing group"
~(?<!@)((?:@(?=[^=]*(\2?+=)[^-]*(\3?+-)[^/]*(\4?+/)))+)(?!@)(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))~

pattern details:

~                          # pattern delimiter
(?<!@)                     # negative lookbehind used as an @ boundary
(                          # first capturing group for the @
    (?:
        @                  # one @
        (?=                # checks that each @ is followed by the same number
                           # of = - /  
            [^=]*          # all that is not an =
            (\2?+=)        # The possessive optional self-referencing group:
                           # capture group 2: backreference to itself + one = 
            [^-]*(\3?+-)   # the same for -
            [^/]*(\4?+/)   # the same for /
        )                  # close the lookahead
    )+                     # close the non-capturing group and repeat
)                          # close the first capturing group
(?!@)                      # negative lookahead used as an @ boundary too.

# this checks the boundaries for all groups
(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))
~

The main idea

The non-capturing group contains only one @. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.

the possessive-optional self-referencing group

How does it work?

( (?: @ (?= [^=]* (\2?+ = ) .....) )+ )

At the first occurence of the @ character the capture group 2 is not yet defined, so you can not write something like that (\2 =) that will make the pattern fail. To avoid the problem, the way is to make the backreference optional: \2?

The second aspect of this group is that the number of character = matched is incremented at each repetition of the non capturing group, since an = is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new = character.

Note that this group can be seen like that: if group 2 exists then match it with the next =

( (?(2)\2) = )

The recursive way

~(?<!@)(?=(@(?>[^@=]+|(?-1))*=)(?!=))(?=(@(?>[^@-]+|(?-1))*-)(?!-))(?=(@(?>[^@/]+|(?-1))*/)(?!/))~

You need to use overlapped matches, since you will use the @ part several times, it is the reason why all the pattern is inside lookarounds.

pattern details:

(?<!@)                # left @ boundary
(?=                   # open a lookahead (to allow overlapped matches)
    (                 # open a capturing group
        @
        (?>           # open an atomic group
            [^@=]+    # all that is not an @ or an =, one or more times
          |           # OR
            (?-1)     # recursion: the last defined capturing group (the current here)
        )*            # repeat zero or more the atomic group
        =             #
    )                 # close the capture group
    (?!=)             # checks the = boundary
)                     # close the lookahead
(?=(@(?>[^@-]+|(?-1))*-)(?!-))  # the same for -
(?=(@(?>[^@/]+|(?-1))*/)(?!/))  # the same for /

The main difference with the precedent pattern is that this one doesn't care about the order of = - and / groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)

Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (^ or \A). And if you want to obtain the whole string as match result you must add .* at the end (otherwise the match result will be empty as playful notices it.)

Solution 2:

Coming back five weeks later because I learned that .NET has something that comes very close to the idea of "quantifier capture" mentioned in the question. The feature is called "balancing groups".

Here is the solution I came up with. It looks long, but it is quite simple.

(?:@(?<c1>)(?<c2>)(?<c3>))+[^@=]+(?<-c1>=)+[^=-]+(?<-c2>-)+[^-/]+(?<-c3>/)+[^/]+(?(c1)(?!))(?(c2)(?!))(?(c3)(?!))

How does it work?

  1. The first non-capturing group matches the @ characters. In that non-capturing group, we have three named groups c1, c2 and c3 that don't match anything, or rather, that match an empty string. These groups will serve as three counters c1, c2 and c3. Because .NET keeps track of intermediate captures when a group is quantified, every time an @ is matched, a capture is added to the capture collections for Groups c1, c2 and c3.

  2. Next, [^@=]+ eats up all the characters up to the first =.

  3. The second quantified group (?<-c1>=)+ matches the = characters. That group seems to be named -c1, but -c1 is not a group name. -c1 is.NET syntax to pop one capture from the c1 group's capture collection into the ether. In other words, it allows us to decrement c1. If you try to decrement c1 when the capture collection is empty, the match fails. This ensures that we can never have more = than @ characters. (Later, we'll have to make sure that we cannot have more @ than = characters.)

  4. The next steps repeat steps 2 and 3 for the - and / characters, decrementing counters c2 and c3.

  5. The [^/]+ eats up the rest of the string.

  6. The (?(c1)(?!)) is a conditional that says "If group c1 has been set, then fail". You may know that (?!) is a common trick to force a regex to fail. This conditional ensures that c1 has been decremented all the way to zero: in other words, there cannot be more @ than = characters.

  7. Likewise, the (?(c2)(?!)) and (?(c3)(?!)) ensure that there cannot be more @ than - and / characters.

I don't know about you, but even this is a bit long, I find it really intuitive.