Catastrophic backtracking shouldn't be happening on this regex

Can someone explain why Java's regex engine goes into catastrophic backtracking mode on this regex? Every alternation is mutually exclusive with every other alternation, from what I can tell.

^(?:[^'\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}][^\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}]*|
\"(?:[^\"]+|\"\")+\"|
'(?:[^']+|'')+')

Text: 'pão de açúcar itaucard mastercard platinum SUSTENTABILIDADE])

Adding possessive matching to some of the alternations fixes the problem, but I have no idea why - Java's regex lib must be extremely buggy to backtrack on mutually exclusive branches.

 ^(?:[^'\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}][^\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}]*|
 \"(?:[^\"]++|\"\")++\"|
 '(?:[^']++|'')++')

EDIT: Added Java version at end — despite being inherently clumsy, unreadable, and unmaintainable.


¡No More Ugly Patterns!

The first thing you need to do is to write your regex in a way that can stand any possible hope of being readable — and consequently maintainable — by human beings. The second thing you need to do is profile this to see what it is actually doing.

That means at a minimum you need to compile it in Pattern.COMMENTS mode (or prefix "(?x)") and then add whitespace to give some visual elbow room. As near as I can make it out, the pattern that you are actually trying to match against is this one:

^ 
(?: [^'"\s~:/@\#|^&\[\]()\\{}]    # alt 1: unquoted
    [^"\s~:/@\#|^&\[\]()\\{}] *
  | " (?: [^"]+ | "" )+ "        # alt 2: double quoted
  | ' (?: [^']+ | '' )+ '        # alt 3: single quoted
)

As you see, I’ve introduced both vertical and horizontal whitespace in the places where I could, in order to guide the eye and mind as a sort of cognitive chunking. I’ve also removed all your extraneous backslashes. Those are all either outright errors or else obfuscators that do nothing but confuse the reader.

Notice how when applying vertical whitespace, I’ve made the parts that are the same from one line to the next occur at the same column so that you can immediately see which parts are the same and which parts are different.

Having done that, I can finally see that what you here is a match anchored to the start which is then followed by a choice of three alternatives. I’ve therefore labelled those three alternatives with a descriptive comment so one doesn't have to guess.

I also notice that your first alternative has two subtly different (negated) square‐bracketed character classes. The second one is lacking the single quote exclusion seen in the first one. Is this intentional? Even if it is, I find this far too much duplication for my tastes; some or all of that should be in a variable so that you don’t risk update coherency issues.


Profiling

The second and more important of the two things that you have to do is to profile this. You need to see exactly what regex program that that pattern is being compiled into, and you need to trace its execution as it runs over your data.

Java’s Pattern class is not currently capable of doing this, although I have spoken at some length about it with the current code custodian at OraSun, and he is both keen to add this capability to Java and thinks he knows exactly how to do it. He even sent me a prototype that does the first part: the compilation. So I expect it to be available one of these days.

Meanwhile, let us turn instead to a tool in which regexes are an integral part of the programming language proper, not something bolted on as an awkward afterthought. Although several languages meet that criterion, in none has the art of pattern matching reached the level of sophistication seen in Perl.

Here then is an equivalent program.

#!/usr/bin/env perl
use v5.10;      # first release with possessive matches
use utf8;       # we have utf8 literals
use strict;     # require variable declarations, etc
use warnings;   # check for boneheadedness

my $match = qr{
    ^ (?: [^'"\s~:/@\#|^&\[\]()\\{}]
          [^"\s~:/@\#|^&\[\]()\\{}] *
        | " (?: [^"]+ | "" )+ "
        | ' (?: [^']+ | '' )+ '
    )
}x;

my $text = "'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])";

my $count = 0;

while ($text =~ /$match/g) {
    print "Got it: $&\n";
    $count++;
}

if ($count == 0) {
    print "Match failed.\n";
}

If we run that program, we get the expected answer that the match failed. The question is why and how.

We now want to look at two things: we want to see what regex program that pattern compiles into, and we then want to trace the execution of that regex program.

Both of these are controlled with the

use re "debug";

pragma, which can also be specified on the command line via -Mre=debug. This is what we’ll do here to avoid hacking on the source code.

Regex Compilation

The re debugging pragma will normally show both the compilation of the pattern as well as its execution. To separate those, we can use Perl’s “compile only” switch, -c, which does not attempt to execution the program it has compiled. That way all we look at is the compiled pattern. Do these produces the following 36 lines of output:

$ perl -c -Mre=debug /tmp/bt
Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
Final program:
   1: BOL (2)
   2: BRANCH (26)
   3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
  14:   STAR (79)
  15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
  26: BRANCH (FAIL)
  27:   TRIE-EXACT["'] (79)
        <"> (29)
  29:     CURLYX[0] {1,32767} (49)
  31:       BRANCH (44)
  32:         PLUS (48)
  33:           ANYOF[\x00-!#-\xff][{unicode_all}] (0)
  44:       BRANCH (FAIL)
  45:         EXACT <""> (48)
  47:       TAIL (48)
  48:     WHILEM[1/2] (0)
  49:     NOTHING (50)
  50:     EXACT <"> (79)
        <'> (55)
  55:     CURLYX[0] {1,32767} (75)
  57:       BRANCH (70)
  58:         PLUS (74)
  59:           ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  70:       BRANCH (FAIL)
  71:         EXACT <''> (74)
  73:       TAIL (74)
  74:     WHILEM[2/2] (0)
  75:     NOTHING (76)
  76:     EXACT <'> (79)
  78: TAIL (79)
  79: END (0)
anchored(BOL) minlen 1 
/tmp/bt syntax OK
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

As you see, the compiled regex program is in something of a “regex assembly language” of its own. (It also happens to look very much like what the Java prototype that was demonstrated to me spits out, so I imagine you’ll one day see this sort of thing in Java, too.) All the details aren’t essential to this, but I will point out that the instruction at node 2 is a BRANCH, which if it fails goes on to execute branch 26, another BRANCH. That second BRANCH, which is the only other part of the regex program, consists of a single TRIE-EXACT node, because it knows that the alternatives have different starting literal strings. What happens w ithin those two trie branches we will discuss presently.

Regex Execution

Now it’s time to see what happens when it runs. The text string you are using causes quite a fair amount of backtracking, which means you would have a lot of output to wade through before it finally fails. How much output? Well, this much:

$ perl -Mre=debug /tmp/bt 2>&1 | wc -l
9987

I assume that 10,000 steps is what you meant by “catastrophic backtracking mode”. Let’s see that we can’t whittle that down into something more comprehensible. Your input string is 61 characters in length. To better see what is happening, we can trim it down to just 'pão, which is only 4 characters. (Well, in NFC, that is; it’s five code points in NFD, but that changes nothing here). That results in 167 lines of output:

$ perl -Mre=debug /tmp/bt 2>&1 | wc -l
167

In fact, here are the lines of regex (compilation plus) execution profiling that you get when your string is this many characters long:

chars   lines   string
    1     63   ‹'›
    2     78   ‹'p›  
    3    109   ‹'pã›
    4    167   ‹'pão› 
    5    290   ‹'pão ›
    6    389   ‹'pão d›
    7    487   ‹'pão de›
    8    546   ‹'pão de ›
    9    615   ‹'pão de a›
   10    722   ‹'pão de aç›
  ....
   61   9987   ‹'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])›

Let’s look at the debugging output when the string is the four characters 'pão. I’ve omitted the compilation part this time to show only the execution part:

$ perl -Mre=debug /tmp/bt
Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o"
UTF-8 string...
   0 <> <'p%x{e3}o>  |  1:BOL(2)
   0 <> <'p%x{e3}o>  |  2:BRANCH(26)
   0 <> <'p%x{e3}o>  |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                            failed...
   0 <> <'p%x{e3}o>  | 26:BRANCH(78)
   0 <> <'p%x{e3}o>  | 27:  TRIE-EXACT["'](79)
   0 <> <'p%x{e3}o>  |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
   1 <'> <p%x{e3}o>  |      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                            got 1 possible matches
                            TRIE matched word #2, continuing
                            only one match left, short-circuiting: #2 <'>
   1 <'> <p%x{e3}o>  | 55:  CURLYX[0] {1,32767}(75)
   1 <'> <p%x{e3}o>  | 74:    WHILEM[2/2](0)
                              whilem: matched 0 out of 1..32767
   1 <'> <p%x{e3}o>  | 57:      BRANCH(70)   1 <'> <p%x{e3}o>          | 58:        PLUS(74)
                                  ANYOF[\x00-&(-\xff][{unicode_all}] can match 3 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:            BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                        failed...
   5 <'p%x{e3}o> <>  | 70:            BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:            NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
   4 <'p%x{e3}> <o>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   4 <'p%x{e3}> <o>  | 57:            BRANCH(70)
   4 <'p%x{e3}> <o>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:                  BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:                    PLUS(74)
                                              ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                              failed...
   5 <'p%x{e3}o> <>  | 70:                  BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:                    EXACT <''>(74)
                                              failed...
                                            BRANCH failed...
                                          whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:                  NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:                  EXACT <'>(79)
                                            failed...
                                          failed...
                                        failed...
   4 <'p%x{e3}> <o>  | 70:            BRANCH(73)
   4 <'p%x{e3}> <o>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   4 <'p%x{e3}> <o>  | 75:            NOTHING(76)
   4 <'p%x{e3}> <o>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
   2 <'p> <%x{e3}o>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   2 <'p> <%x{e3}o>  | 57:            BRANCH(70)
   2 <'p> <%x{e3}o>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 2 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:                  BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:                    PLUS(74)
                                              ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                              failed...
   5 <'p%x{e3}o> <>  | 70:                  BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:                    EXACT <''>(74)
                                              failed...
                                            BRANCH failed...
                                          whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:                  NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:                  EXACT <'>(79)
                                            failed...
                                          failed...
   4 <'p%x{e3}> <o>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767
   4 <'p%x{e3}> <o>  | 57:                  BRANCH(70)
   4 <'p%x{e3}> <o>  | 58:                    PLUS(74)
                                              ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                      WHILEM[2/2](0)
                                                whilem: matched 3 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:                        BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:                          PLUS(74)
                                                    ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647.
..
                                                    failed...
   5 <'p%x{e3}o> <>  | 70:                        BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:                          EXACT <''>(74)
                                                    failed...
                                                  BRANCH failed...
                                                whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:                        NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:                        EXACT <'>(79)
                                                  failed...
                                                failed...
                                              failed...
   4 <'p%x{e3}> <o>  | 70:                  BRANCH(73)
   4 <'p%x{e3}> <o>  | 71:                    EXACT <''>(74)
                                              failed...
                                            BRANCH failed...
                                          whilem: failed, trying continuation...
   4 <'p%x{e3}> <o>  | 75:                  NOTHING(76)
   4 <'p%x{e3}> <o>  | 76:                  EXACT <'>(79)
                                            failed...
                                          failed...
                                        failed...
   2 <'p> <%x{e3}o>  | 70:            BRANCH(73)
   2 <'p> <%x{e3}o>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   2 <'p> <%x{e3}o>  | 75:            NOTHING(76)
   2 <'p> <%x{e3}o>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
                                  failed...
   1 <'> <p%x{e3}o>  | 70:      BRANCH(73)
   1 <'> <p%x{e3}o>  | 71:        EXACT <''>(74)
                                  failed...
                                BRANCH failed...
                              failed...
                            failed...
                          BRANCH failed...
Match failed
Match failed.
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

What you see happening there is that the trie quickly branches down to node 55, which is the last of your three alternatives after it has matched the single quote, because your target string starts with a single quote. That is this one:

  | ' (?: [^']+ | '' )+ '        # alt 3: single quoted

Node 55 is this trie branch:

        <'> (55)
  55:     CURLYX[0] {1,32767} (75)
  57:       BRANCH (70)
  58:         PLUS (74)
  59:           ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  70:       BRANCH (FAIL)
  71:         EXACT <''> (74)

And here is the execution trace showing where your catastrophic backoff is happening:

   1 <'> <p%x{e3}o>  | 74:    WHILEM[2/2](0)
                              whilem: matched 0 out of 1..32767
   1 <'> <p%x{e3}o>  | 57:      BRANCH(70)
   1 <'> <p%x{e3}o>  | 58:        PLUS(74)
                                  ANYOF[\x00-&(-\xff][{unicode_all}] can match 3 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:            BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                        failed...
   5 <'p%x{e3}o> <>  | 70:            BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:            NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
   4 <'p%x{e3}> <o>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   4 <'p%x{e3}> <o>  | 57:            BRANCH(70)
   4 <'p%x{e3}> <o>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767

Node 58 gobbled up all 3 remaining characters in the string, pão. That caused the terminating exact match of a single quote to fail. It therefore attempts your alternative, which is a pair of single quotes, and that also fails.

It is at this point I have to question your pattern. Shouldn’t

' (?: [^']+ | '' )+ '

really simply be this?

' [^']* '

So what is going on is that there are lot of ways for it to backtrack looking for something that can logically never occur. You have a nested quantifier, and that is causing all kinds of hopeless and mindless busywork.

If we swap simplify the pattern into this:

^ (?: [^'"\s~:/@\#|^&\[\]()\\{}] +
    | " [^"]* "
    | ' [^']* '
  )

It now gives the same number of trace output lines no matter the size of the input string: only 40, and that includes the compilation. Witness both compilation and execution on the full string:

Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
Final program:
   1: BOL (2)
   2: BRANCH (26)
   3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
  14:   STAR (61)
  15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
  26: BRANCH (FAIL)
  27:   TRIE-EXACT["'] (61)
        <"> (29)
  29:     STAR (41)
  30:       ANYOF[\x00-!#-\xff][{unicode_all}] (0)
  41:     EXACT <"> (61)
        <'> (46)
  46:     STAR (58)
  47:       ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  58:     EXACT <'> (61)
  60: TAIL (61)
  61: END (0)
anchored(BOL) minlen 1 
Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o de a%x{e7}%x{fa}car itaucard mast
ercard platinum S"...
UTF-8 string...
   0 <> <'p%x{e3}o >  |  1:BOL(2)
   0 <> <'p%x{e3}o >  |  2:BRANCH(26)
   0 <> <'p%x{e3}o >  |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                             failed...
   0 <> <'p%x{e3}o >  | 26:BRANCH(60)
   0 <> <'p%x{e3}o >  | 27:  TRIE-EXACT["'](61)
   0 <> <'p%x{e3}o >  |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
   1 <'> <p%x{e3}o d> |      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                             got 1 possible matches
                             TRIE matched word #2, continuing
                             only one match left, short-circuiting: #2 <'>
   1 <'> <p%x{e3}o d> | 46:  STAR(58)
                             ANYOF[\x00-&(-\xff][{unicode_all}] can match 60 times out of 2147483647...
                             failed...
                           BRANCH failed...
Match failed
Match failed.
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

I know you were thinking that possessive matching might be the answer here, but I think the real problem is the faulty logic in the original pattern. See how more sanely this runs now?

If we run it with your possessives on the old pattern, even though I don’t think that makes sense, we still get constant runtime, but it takes more steps. With this pattern

   ^ (?: [^'"\s~:/@\#|^&\[\]()\\{}] +    # alt 1: unquoted
       | " (?: [^"]++ | "" )++ "        # alt 2: double quoted
       | ' (?: [^']++ | '' )++ '        # alt 3: single quoted
     )

The compilation plus execution profile is as follows:

Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
Final program:
   1: BOL (2)
   2: BRANCH (26)
   3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
  14:   STAR (95)
  15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
  26: BRANCH (FAIL)
  27:   TRIE-EXACT["'] (95)
        <"> (29)
  29:     SUSPEND (58)
  31:       CURLYX[0] {1,32767} (55)
  33:         BRANCH (50)
  34:           SUSPEND (54)
  36:             PLUS (48)
  37:               ANYOF[\x00-!#-\xff][{unicode_all}] (0)
  48:             SUCCEED (0)
  49:           TAIL (53)
  50:         BRANCH (FAIL)
  51:           EXACT <""> (54)
  53:         TAIL (54)
  54:       WHILEM[1/2] (0)
  55:       NOTHING (56)
  56:       SUCCEED (0)
  57:     TAIL (58)
  58:     EXACT <"> (95)
        <'> (63)
  63:     SUSPEND (92)
  65:       CURLYX[0] {1,32767} (89)
  67:         BRANCH (84)
  68:           SUSPEND (88)
  70:             PLUS (82)
  71:               ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  82:             SUCCEED (0)
  83:           TAIL (87)
  84:         BRANCH (FAIL)
  85:           EXACT <''> (88)
  87:         TAIL (88)
  88:       WHILEM[2/2] (0)
  89:       NOTHING (90)
  90:       SUCCEED (0)
  91:     TAIL (92)
  92:     EXACT <'> (95)
  94: TAIL (95)
  95: END (0)
anchored(BOL) minlen 1 
Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o de a%x{e7}%x{fa}car itaucard mastercard platinum S"...
UTF-8 string...
   0 <> <'p%x{e3}o > |  1:BOL(2)
   0 <> <'p%x{e3}o > |  2:BRANCH(26)
   0 <> <'p%x{e3}o > |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                            failed...
   0 <> <'p%x{e3}o > | 26:BRANCH(94)
   0 <> <'p%x{e3}o > | 27:  TRIE-EXACT["'](95)
   0 <> <'p%x{e3}o > |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
   1 <'> <p%x{e3}o d>|      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                            got 1 possible matches
                            TRIE matched word #2, continuing
                            only one match left, short-circuiting: #2 <'>
   1 <'> <p%x{e3}o d>| 63:  SUSPEND(92)
   1 <'> <p%x{e3}o d>| 65:    CURLYX[0] {1,32767}(89)
   1 <'> <p%x{e3}o d>| 88:      WHILEM[2/2](0)
                                whilem: matched 0 out of 1..32767
   1 <'> <p%x{e3}o d>| 67:        BRANCH(84)
   1 <'> <p%x{e3}o d>| 68:          SUSPEND(88)
   1 <'> <p%x{e3}o d>| 70:            PLUS(82)
                                      ANYOF[\x00-&(-\xff][{unicode_all}] can match 60 times out of 2147483647...
  64 <NTABILIDAD])> <| 82:              SUCCEED(0)
                                        subpattern success...
  64 <NTABILIDAD])> <| 88:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
  64 <NTABILIDAD])> <| 67:            BRANCH(84)
  64 <NTABILIDAD])> <| 68:              SUSPEND(88)
  64 <NTABILIDAD])> <| 70:                PLUS(82)
                                          ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                          failed...
                                        failed...
  64 <NTABILIDAD])> <| 84:            BRANCH(87)
  64 <NTABILIDAD])> <| 85:              EXACT <''>(88)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
  64 <NTABILIDAD])> <| 89:            NOTHING(90)
  64 <NTABILIDAD])> <| 90:            SUCCEED(0)
                                      subpattern success...
  64 <NTABILIDAD])> <| 92:  EXACT <'>(95)
                failed...
              BRANCH failed...
Match failed
Match failed.
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

I still like my solution better. It’s shorter.


EDIT

It appears that the Java version really is taking 100 times more steps than the Perl version of an identical pattern, and I have no idea why — other than that the Perl regex compiler is about 100 times smarter at optimizations than the Java regex compiler, which doesn’t ever do any to speak of, and should.

Here’s the equivalent Java program. I’ve removed the leading anchor so that we can loop properly.

$ cat java.crap
import java.util.regex.*;

public class crap {

public static void
main(String[ ] argv) {
    String input = "'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])";
    String regex = "\n"
                + "(?: [^'\"\\s~:/@\\#|^&\\[\\]()\\\\{}]    # alt 1: unquoted         \n"
                + "    [^\"\\s~:/@\\#|^&\\[\\]()\\\\{}] *                     \n"
                + "  | \" (?: [^\"]++ | \"\" )++ \"       # alt 2: double quoted   \n"
                + "  | ' (?: [^']++ | '' )++ '       # alt 3: single quoted   \n"
                + ")                                                          \n"
                ;
    System.out.printf("Matching ‹%s› =~ qr{%s}x\n\n", input, regex);

    Pattern regcomp = Pattern.compile(regex, Pattern.COMMENTS);
    Matcher regexec = regcomp.matcher(input);

    int count;
    for (count = 0; regexec.find(); count++) {
       System.out.printf("Found match: ‹%s›\n", regexec.group());
    }
    if (count == 0) {
        System.out.printf("Match failed.\n");
    }
  }
}

When run, that produces this:

$ javac -encoding UTF-8 crap.java && java -Dfile.encoding=UTF-8 crap
Matching ‹'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])› =~ qr{
(?: [^'"\s~:/@\#|^&\[\]()\\{}]    # alt 1: unquoted         
    [^"\s~:/@\#|^&\[\]()\\{}] *                     
  | " (?: [^"]++ | "" )++ "       # alt 2: double quoted   
  | ' (?: [^']++ | '' )++ '       # alt 3: single quoted   
)                                                          
}x

Found match: ‹pão›
Found match: ‹de›
Found match: ‹açúcar›
Found match: ‹itaucard›
Found match: ‹mastercard›
Found match: ‹platinum›
Found match: ‹SUSTENTABILIDAD›

As you can plainly see, pattern matching in Java has quite a lot to be said about it, absolutely none of which would pass the potty-mouth police. It’s simply a royal pain in the ass.


I have to admit this surprised me, too, but I get the same result in RegexBuddy: it quits trying after a million steps. I know the warnings about catastrophic backtracking tend to focus on nested quantifiers, but in my experience alternation is at least as dangerous. In fact, if I change the last part of your regex from this:

'(?:[^']+|'')+'

...to this:

'(?:[^']*(?:''[^']*)*)'

...it fails in only eleven steps. This is an example of Friedl's "unrolled loop" technique, which he breaks down like this:

opening normal * ( special normal * ) * closing
   '     [^']        ''     [^']           '

The nested stars are safe as long as:

  1. special and normal can never match the same thing,
  2. special always matches at least one character, and
  3. special is atomic (there must be only one way for it to match).

The regex will then fail to match with minimal backtracking, and succeed with no backtracking at all. The alternation version, on the other hand, is almost guaranteed to backtrack, and where no match is possible it quickly spirals out of control as the length of the target string increases. If it doesn't backtrack excessively in some flavors, it's because they have optimizations built in specifically to counter this problem--something very few flavors do, so far.