Grep -E, Sed -E - low performance when '[x]{1,9999}' is used, but why?

When grep or sed are used with the option --extended-regexp and the pattern {1,9999} is a part of the regexp that is used, the performance of these commands become low. To be more clear, below are applied few tests.[1] [2]

  • The relative performance of grep -E, egrep and sed -E is almost equal, so only the test that were made with grep -E are provided.

Test 1

$ time grep -E '[0-9]{1,99}' < /dev/null

real    0m0.002s

Test 2

$ time grep -E '[0-9]{1,9999}' < /dev/null

> real    0m0.494s

Test 3

$ time grep -E '[0123456789]{1,9999}' < /dev/null

> real    21m43.947s

Test 4

$ time grep -E '[0123456789]+' < /dev/null
$ time grep -E '[0123456789]*' < /dev/null
$ time grep -E '[0123456789]{1,}' < /dev/null
$ time grep -P '[0123456789]{1,9999}' < /dev/null

real    0m0.002s       

What is the reason of this significant difference of the performance?


Solution 1:

Note that it's not the matching that takes time, but the building of the RE. You'll find that it uses quite a lot of RAM as well:

$ valgrind grep -Eo '[0-9]{1,9999}' < /dev/null
==6518== HEAP SUMMARY:
==6518==     in use at exit: 1,603,530,656 bytes in 60,013 blocks
==6518==   total heap usage: 123,613 allocs, 63,600 frees, 1,612,381,621 bytes allocated
$ valgrind grep -Eo '[0-9]{1,99}' < /dev/null
==6578==     in use at exit: 242,028 bytes in 613 blocks
==6578==   total heap usage: 1,459 allocs, 846 frees, 362,387 bytes allocated
$ valgrind grep -Eo '[0-9]{1,999}' < /dev/null
==6594== HEAP SUMMARY:
==6594==     in use at exit: 16,429,496 bytes in 6,013 blocks
==6594==   total heap usage: 12,586 allocs, 6,573 frees, 17,378,572 bytes allocated

The number of allocs seems roughly proportional to the number of iterations, but the memory allocated seems to grow exponentially.

That's down to how GNU regexps are implemented. If you compile GNU grep with CPPFLAGS=-DDEBUG ./configure && make, and run those commands, you'll see the exponential effect in action. Going deeper than that would mean going through a lot of theory on DFA and dive into the gnulib regexp implementation.

Here, you can use PCREs instead that doesn't seem to have the same problem: grep -Po '[0-9]{1,65535}' (the maximum, though you can always do things like [0-9](?:[0-9]{0,10000}){100} for 1 to 1,000,001 repetitions) doesn't take more time nor memory than grep -Po '[0-9]{1,2}'.