Grep -E, Sed -E - low performance when '[x]{1,9999}' is used, but why?
When grep
or sed
are used with the option --extended-regexp
and the pattern {1,9999}
is a part of the regexp that is used, the performance of these commands become low. To be more clear, below are applied few tests.[1] [2]
- The relative performance of
grep -E
,egrep
andsed -E
is almost equal, so only the test that were made withgrep -E
are provided.
Test 1
$ time grep -E '[0-9]{1,99}' < /dev/null
real 0m0.002s
Test 2
$ time grep -E '[0-9]{1,9999}' < /dev/null
> real 0m0.494s
Test 3
$ time grep -E '[0123456789]{1,9999}' < /dev/null > real 21m43.947s
Test 4
$ time grep -E '[0123456789]+' < /dev/null
$ time grep -E '[0123456789]*' < /dev/null
$ time grep -E '[0123456789]{1,}' < /dev/null
$ time grep -P '[0123456789]{1,9999}' < /dev/null
real 0m0.002s
What is the reason of this significant difference of the performance?
Solution 1:
Note that it's not the matching that takes time, but the building of the RE. You'll find that it uses quite a lot of RAM as well:
$ valgrind grep -Eo '[0-9]{1,9999}' < /dev/null
==6518== HEAP SUMMARY:
==6518== in use at exit: 1,603,530,656 bytes in 60,013 blocks
==6518== total heap usage: 123,613 allocs, 63,600 frees, 1,612,381,621 bytes allocated
$ valgrind grep -Eo '[0-9]{1,99}' < /dev/null
==6578== in use at exit: 242,028 bytes in 613 blocks
==6578== total heap usage: 1,459 allocs, 846 frees, 362,387 bytes allocated
$ valgrind grep -Eo '[0-9]{1,999}' < /dev/null
==6594== HEAP SUMMARY:
==6594== in use at exit: 16,429,496 bytes in 6,013 blocks
==6594== total heap usage: 12,586 allocs, 6,573 frees, 17,378,572 bytes allocated
The number of allocs seems roughly proportional to the number of iterations, but the memory allocated seems to grow exponentially.
That's down to how GNU regexps are implemented. If you compile GNU grep
with CPPFLAGS=-DDEBUG ./configure && make
, and run those commands, you'll see the exponential effect in action. Going deeper than that would mean going through a lot of theory on DFA and dive into the gnulib regexp implementation.
Here, you can use PCREs instead that doesn't seem to have the same problem: grep -Po '[0-9]{1,65535}'
(the maximum, though you can always do things like [0-9](?:[0-9]{0,10000}){100}
for 1 to 1,000,001 repetitions) doesn't take more time nor memory than grep -Po '[0-9]{1,2}'
.