Capturing text before and after a C-style code block with a Perl regular expression

I am trying to capture some text before and after a C-style code block using a Perl regular expression. So far this is what I have:

use strict;
use warnings;

my $text = << "END";
int max(int x, int y)
{
    if (x > y)
    {
        return x;
    }
    else
    {
        return y;
    }
}
// more stuff to capture
END

# Regex to match a code block
my $code_block = qr/(?&block)
(?(DEFINE)
    (?<block>
        \{                # Match opening brace
            (?:           # Start non-capturing group
                [^{}]++   #     Match non-brace characters without backtracking
                |         #     or
                (?&block) #     Recursively match the last captured group
            )*            # Match 0 or more times
        \}                # Match closing brace
    )
)/x;

# $2 ends up undefined after the match
if ($text =~ m/(.+?)$code_block(.+)/s){
    print $1;
    print $2;
}

I am having an issue with the 2nd capture group not being initialized after the match. Is there no way to continue a regular expression after a DEFINE block? I would think that this should work fine.

$2 should contain the comment below the block of code but it doesn't and I can't find a good reason why this isn't working.


Solution 1:

Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:

m/
  (.+?)  # group 1
  (?:  # the $code_block regex
    (?&block)
    (?(DEFINE)
      (?<block> ... )  # group 2
    )
  )
  (.+)  # group 3
/xs

Named groups can also be accessed as numbered groups.

The 2nd group is the block group. However, this group is only used as a named subpattern, not as a capture. As such, the $2 capture value is undef.

As a consequence, the text after the code-block will be stored in capture $3.

There are two ways to deal with this problem:

  • For complex regexes, only use named capture. Consider a regex to be complex as soon as you assemble it from regex objects, or if captures are conditional. Here:

    if ($text =~ m/(?<before>.+?)$code_block(?<afterwards>.+)/s){
        print $+{before};
        print $+{afterwards};
    }
    
  • Put all your defines at the end, where they can't mess up your capture numbering. For example, your $code_block regex would only define a named pattern which you then invoke explicitly.

Solution 2:

There are also ready tools that can be leveraged for this, in a few lines of code.

Perhaps the first module to look at is the core Text::Balanced.

The extract_bracketed in list context returns: matched substring, remainder of the string after the match, and the substring before the match. Then we can keep matching in the remainder

use warnings;
use strict;
use feature 'say';

use Text::Balanced qw/extract_bracketed/;

my $text = 'start {some {stuff} one} and {more {of it} two}, and done';

my ($match, $lead);
while (1) {
    ($match, $text, $lead) = extract_bracketed($text, '{', '[^{]*');
    say $lead // $text;
    last if not defined $match; 
}

what prints

start 
 and 
, and done

Once there is no match we need to print the remainder, thus $lead // $text (as there can be no $lead either). The code uses $text directly and modifies it, down to the last remainder; if you'd like to keep the original text save it away first.

I've used a made-up string above, but I tested it on your code sample as well.


This can also be done using Regexp::Common.

Break the string using its $RE{balanced} regex, then take odd elements

use Regexp::Common qw(balanced);

my @parts = split /$RE{balanced}{-parens=>'{}'}/, $text;

my @out_of_blocks = @parts[  grep { $_ & 1 } 1..$#parts ];

say for @out_of_blocks;

If the string starts with the delimiter the first element is an empty string, as usual with split.

To clean out leading and trailing spaces pass it through map { s/(^\s*|\s*$//gr }.

Solution 3:

You're very close.

(?(DEFINE)) will define the expression & parts you want to use but it doesn't actually do anything other than define them. Think of this tag (and everything it envelops) as you defining variables. That's nice and clean, but defining the variables doesn't mean the variables get used!

You want to use the code block after defining it so you need to add the expression after you've declared your variables (like in any programming language)

(?(DEFINE)
  (?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)

This part defines your variables

(?(DEFINE)
  (?<block>\{(?:[^{}]++|(?&block))*\})
)

This part calls your variables into use.

(?&block)

Edits

Edit 1

(?(DEFINE)
  (?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:\/\/|\/\*)([\s\S]*?)(?:\r\n|\r|\n|$)

The regex above will get the comment after a block (as you've already defined).

You had a . which will match any character (except newline - unless you use the s modifier which specifies that . should also match newline characters)

Edit 2

(?(DEFINE)
  (?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)

This regex is more syntactically correct for capturing comments. The previous edit will work with /* up until a new line or end of file. This one will work until the closing tag or end of file.

Edit 3

As for your code not working, I'm not exactly sure. You can see your code running here and it seems to be working just fine. I would use one of the regular expressions I've written above instead.

Edit 4

I think I finally understand what you're saying. What you're trying to do is impossible with regex. You cannot reference a group without capturing it, therefore, the only true solution is to capture it. There is, however, a hack-around alternative that works for your situation. If you want to grab the first and last sections without the second section you can use the following regex, which, will not check the second section of your regex for proper syntax (downside). If you do need to check the syntax you're going to have to deal with there being an additional capture group.

(.+?)\{.*\}\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)

This regex captures everything before the { character, then matches everything after it until it meets } followed by any whitespace, and finally by //. This, however, will break if you have a comment within a block of code (after a })