Why is `\s+` so much faster than `\s\s*` in this Perl regex?

First, even if the resulting regex will not keep the same meaning, let's reduces regexes to \s*0 and \s+0 and use (" " x 4) . "_0" as an input. For the sceptics, you can see here that the lag is still present.

Now let's consider the following code:

$x = (" " x 4) . "_ 0";
$x =~ /\s*0/; # The slow line 
$x =~ /\s+0/; # The fast line

Digging a bit with use re debugcolor; we get the following output:

Guessing start of match in sv for REx "\s*0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
   0 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
   1 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 3 times out of 2147483647...
                                  failed...
   2 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 2 times out of 2147483647...
                                  failed...
   3 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 1 times out of 2147483647...
                                  failed...
   5 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 0 times out of 2147483647...
   5 <    _0>|  3:  EXACT <0>(5)
   6 <    _0>|  5:  END(0)
Match successful!

-----------------------

Guessing start of match in sv for REx "\s+0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes)
   0 <    _0>|  1:PLUS(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed

Perl seems to be optimized for failure. It will first look for constant strings (which only consume O(N)). Here, it'll look for 0 : Found floating substr "0" at offset 5...

Then it'll start with the variable part of the regex, respectivly \s* and \s+, against the whole minimum string to check:

Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char

After that it'll look for the first position meeting the stclass requirement, here at position 0.

  • \s*0:
    • starts at 0, find 4 spaces then fail;
    • starts at 1, find 3 spaces then fail;
    • starts at 2, find 2 spaces then fail;
    • starts at 3, find 1 spaces then fail;
    • starts at 4, find 0 spaces then doesn't fail;
    • Find an exact 0
  • \s+0:
    • starts at 0, find 4 spaces then fail. As the minimum number of spaces is not matched, the regex fails instantly.

If you want to have fun with Perl regex optimization, you can consider the following regexes / *\n and / * \n. At first glance, they look the same, have the same meaning... But if you run its against (" " x 40000) . "_\n" the first one will check all possibilities while the second one will look for " \n" and fail immediately.

In a vanilla, non-optimized regex engine, both regex can cause catastrophic backtracking, since they need to retry the pattern as it bumps along. However, in the example above, the second doesn't fail with Perl because it have been optimized to find floating substr "0%n"


You can see another example on Jeff Atwood's blog.

Note also that the issue is not about \s consideration but any pattern where xx* is used instead of x+ see example with 0s and also regex explosive quantifiers

With such shorter example the behavior is "findable", but if you start to play with complicated patterns, it's far from easy to spot, for example: Regular expression hangs program (100% CPU usage)


When there is a "plus" node (e.g. \s+) at the beginning of a pattern and the node fails to match, the regex engine skips forward to the point of failure and tries again; with \s*, on the other hand, the engine only advances one character at a time.

Yves Orton explains this optimization nicely here:

The start class optimisation has two modes, "try every valid start position" (doevery) and "flip flop mode" (!doevery) where it trys only the first valid start position in a sequence.

Consider /(\d+)X/ and the string "123456Y", now we know that if we fail to match X after matching "123456" then we will also fail to match after "23456" (assuming no evil tricks are in place, which disable the optimisation anyway), so we know we can skip forward until the check /fails/ and only then start looking for a real match. This is flip-flop mode.

/\s+/ triggers flip-flop mode; /\s*/, /\s\s*/, and /\s\s+/ don't. This optimization can't be applied to "star" nodes like \s* because they can match zero characters, so a failure at one point in a sequence isn't indicative of failure later in the same sequence.


You can see this in the debug output for each regex. I've highlighted the skipped characters with ^. Compare this (skips four characters at a time):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d+x/'
   ...
   0 <> <123 456 78>         |  1:PLUS(3)
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   4 <123 > <456 789 x>      |  1:PLUS(3)
      ^^^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   8 <23 456 > <789 x>       |  1:PLUS(3)
         ^^^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...

to this (skips one or two characters at a time):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d*x/'
   ...
   0 <> <123 456 78>         |  1:STAR(3)
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   1 <1> <23 456 789>        |  1:STAR(3)
      ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
   2 <12> <3 456 789 >       |  1:STAR(3)
       ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
   4 <123 > <456 789 x>      |  1:STAR(3)
        ^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   5 <123 4> <56 789 x>      |  1:STAR(3)
          ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
   6 <23 45> <6 789 x>       |  1:STAR(3)
          ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
   8 <23 456 > <789 x>       |  1:STAR(3)
           ^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   9 <23 456 7> <89 x>       |  1:STAR(3)
             ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
  10 <23 456 78> <9 x>       |  1:STAR(3)
              ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
  12 <23 456 789 > <x>       |  1:STAR(3)
               ^^
                                  POSIXD[\d] can match 0 times out of 2147483647...
  12 <23 456 789 > <x>       |  3:  EXACT <x>(5)
  13 <23 456 789 x> <>       |  5:  END(0)

Note that the optimization isn't applied to /\s\s+/, because \s+ isn't at the beginning of the pattern. Both /\s\s+/ (logically equivalent to /\s{2,}/) and /\s\s*/ (logically equivalent to /\s+/) probably could be optimized, though; it might make sense to ask on perl5-porters whether either would be worth the effort.


In case you're interested, "flip-flop mode" is enabled by setting the PREGf_SKIP flag on a regex when it's compiled. See the code around lines 7344 and 7405 in regcomp.c and line 1585 in regexec.c in the 5.24.0 source.


The \s+\n requires that the character preceding the \n is a SPACE.

According to use re qw(debug) the compilation establishes that it needs a straight string of a known number of spaces, up to the substring \n, which is first checked for in the input. Then it checks the fixed-length space-only substring against the remaining part of input, failing as it comes to _. It's a single possibility to check, regardless of how many spaces the input has. (When there are more _\n each is found to fail equally directly, per debug output.)

Looked at it this way, it's an optimization you'd almost expect, utilizing a rather specific search pattern and lucking out with this input. Except when taken in comparison with other engines, which clearly don't do this kind of analysis.

With \s*\n this is not the case. Once the \n is found and the previous character is not a space, the search hasn't failed since \s* allows nothing (zero characters). There are no fixed-length substrings either, and it's in the backtracking game.