How can I match nested brackets using regex?

Solution 1:

Many regex implementations will not allow you to match an arbitrary amount of nesting. However, Perl, PHP and .NET support recursive patterns.

A demo in Perl:

#!/usr/bin/perl -w

my $text = '(outer
   (center
     (inner)
     (inner)
   center)
 ouer)
 (outer
   (inner)
 ouer)
 (outer
 ouer)';

while($text =~ /(\(([^()]|(?R))*\))/g) {
  print("----------\n$1\n");
}

which will print:

----------
(outer
   (center
     (inner)
     (inner)
   center)
 ouer)
----------
(outer
   (inner)
 ouer)
----------
(outer
 ouer)

Or, the PHP equivalent:

$text = '(outer
   (center
     (inner)
     (inner)
   center)
 ouer)
 (outer
   (inner)
 ouer)
 (outer
 ouer)';

preg_match_all('/(\(([^()]|(?R))*\))/', $text, $matches);

print_r($matches);

which produces:

Array
(
    [0] => Array
        (
            [0] => (outer
   (center
     (inner)
     (inner)
   center)
 ouer)
            [1] => (outer
   (inner)
 ouer)
            [2] => (outer
 ouer)
        )

...

An explanation:

(          # start group 1
  \(       #   match a literal '('
  (        #   group 2
    [^()]  #     any char other than '(' and ')'
    |      #     OR
    (?R)   #     recursively match the entir pattern
  )*       #   end group 2 and repeat zero or more times
  \)       #   match a literal ')'
)          # end group 1

EDIT

Note @Goozak's comment:

A better pattern might be \(((?>[^()]+)|(?R))*\) (from PHP:Recursive patterns). For my data, Bart's pattern was crashing PHP when it encountered a (long string) without nesting. This pattern went through all my data without problem.

Solution 2:

Don't use regex.

Instead, a simple recursive function will suffice. Here's the general structure:

def recursive_bracket_parser(s, i):
    while i < len(s):
        if s[i] == '(':
            i = recursive_bracket_parser(s, i+1)
        elif s[i] == ')':
            return i+1
        else:
            # process whatever is at s[i]
            i += 1
    return i

For example, here's a function that will parse the input into a nested list structure:

def parse_to_list(s, i=0):
    result = []
    while i < len(s):
        if s[i] == '(':
            i, r = parse_to_list(s, i+1)
            result.append(r)
        elif s[i] == ')':
            return i+1, result
        else:
            result.append(s[i])
            i += 1
    return i, result

Calling this like parse_to_list('((a) ((b)) ((c)(d)))efg') produces the result [[['a'], ' ', [['b']], ' ', [['c'], ['d']]], 'e', 'f', 'g'].

Solution 3:

I found this simple regex which extracts all nested balanced groups using recursion, although the resulting solution is not quite straightforward as you may expect:

Regex pattern: (1(?:\1??[^1]*?2))+

Sample input: 1ab1cd1ef2221ab1cd1ef222

For simplicity I put 1 for open and 2 for closed bracket. Alpha characters represent some inner data. I'll rewrite input so that it can be easy to explain.

1  ab  1 cd 1ef2 2  2     1  ab  1 cd 1ef2 2  2

            |_1|
       |______2__|
|_____________3_____|

In first iteration regex will match the most inner subgroup 1ef2 of in first sibling group 1ab1cd1ef222. If we remember it and it's position, and remove this group, there would remain 1ab1cd22. If we continue with regex, it would return 1cd2, and finally 1ab2. Then, it will continue to parse second sibling group the same way.

As we see from this example, regex will properly extract substrings as they appear in the hierarchy defined by brackets. Position of particularly substring in hierarchy will be determined during second iteration, if it's position in string is between substring from second iteration, then it is a child node, else it's a sibling node.

From our example:

  1. 1ab1cd1ef222 1ab1cd1ef222, iteration match 1ef2, with index 6,

  2. 1ab1cd22 1ab1cd1ef222, iteration match 1cd2, with index 3, ending with 6. Because 3 < 6 <= 6, first substring is child of the second substring.

  3. 1ab2 1ab1cd1ef222, iteration match 1ab2, with index 0, ending with 3. Because 0 < 3 <= 3, first substring is child of the second substring.

  4. 1ab1cd1ef222, iteration match 1ef2, with index 6, Because it's not 3 < 0 <= 6, it's branch from another sibling, etc...

We must iterate and remove all siblings before we can move to the parent. Thus, we must remember all that siblings in the order they appear in iteration.