Why is this regular expression so slow in Java? [duplicate]
Caveat: I don't really know much about regex internals, and this is really conjecture. And I can't answer why Java suffers from this, but not the others (also, it is substantially faster than your 12 seconds in jshell 11 when I run it, so it perhaps only affects certain versions).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b")
There are lots of ways that lots of a
s could match:
(a)(a)(a)(a)
(aa)(a)(a)
(a)(aa)(a)
(aa)(aa)
(a)(aaa)
etc.
For the input string "aaaaaaaaaaaaaaaaaaaaaaaaaaaab"
, it will greedily match all of those a
s in a single pass, match the b
, job done.
For "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"
, when it gets to the end and finds that the string doesn't match (because of the s
), it's not correctly recognizing that the s
means it can never match. So, having gone through and likely matched as
(aaaaaaaaaaaaaaaaaaaaaaaaaaaa)bs
it thinks "Oh, maybe it failed because of the way I grouped the a
s - and goes back and tries all the other combinations of the a
s.
(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a)bs // Nope, still no match
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa)(aaa)bs // ...
...
(a)(aaaaaaaaaaaaaaaaaaaaaaaaaaa)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaaaa(a)(a)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa(aa)(a)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaa(aaa)(a)bs // ...
...
There are lots of these (I think there are something like 2^27 - that's 134,217,728 - combinations for 28 a
s, because each a
can either be part of the previous group, or start its own group), so it takes a long time.
I don't know Perl too well but the Python version is not equivalent to the Java one. You are using search()
but the Java version is using matches()
. The equivalent method in Python would be fullmatch()
When I run your examples in Python (3.8.2) with search()
I get quick results as you do. When I run it with fullmatch()
I get poor (multi-second) execution time. Could it be that your Perl example is also not doing a full match?
BTW: if you want to try the Java version of search you would use:
Pattern.compile("(a+)+b").matcher("aaaaaaaaaaaaaaaaaaaaaaaaaaaabs").find();
There might be some slight difference in the semantics but it should be close enough for this purpose.