Why does this take so long to match? Is it a bug?

I need to match certain URLs in web application, i.e. /123,456,789, and wrote this regex to match the pattern:

r'(\d+(,)?)+/$'

I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:

re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')

The expected result would be that there were no matches.

This expression, however, executes almost immediately (note the trailing slash):

re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')

Is this a bug?

Solution 1:

There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.

The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [\d,]+/$. However, that is not perfect since it would allow for something like ,123,,4,5/.

For this you can use a slightly optimized version of your initial try: (?:\d,?)+/$. First, I made your repeating group non-capturing ((?:...)) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the \d inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional , since ? only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /.

This can still match an odd string 1,2,3,/, so for the heck of it I improved your original regex with a negative lookbehind: (?:\d,?)+(?<!,)/$. This will assert that there is no comma directly before the trailing /.

Solution 2:

First off, I must say that it is not a BUG. its because of that , it tries all the possibilities, it takes time and computing resources. Sometimes it can gobble up a lot of time. When it gets really bad, it’s called catastrophic backtracking.

This is the code of findall function in python source code:

 def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.
    If one or more groups are present in the pattern, return a
    list of groups; this will be a list of tuples if the pattern
    has more than one group.
    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

As you see it just use the compile() function, so based on _compile() function that actually use Traditional NFA that python use for its regex matching, and base on this short explain about backtracking in regular expression in Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl!

The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options, it selects one and remembers the other to return to later if need be. Situations where it has to decide among courses of action include anything with a quantifier (decide whether to try another match), and alternation (decide which alter native to try, and which to leave for later). Whichever course of action is attempted, if it’s successful and the rest of the regex is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the first option, and can continue with the match by trying the other option. This way, it eventually tries all possible permutations of the regex (or at least as many as needed until a match is found).

Let's go inside your pattern: So you have r'(\d+(,)?)+/$' with this string '12345121,223456,123123,3234,4523,523523' we have this steps:

At first, the first part of your string (12345121) is matched with \d+, then , is matched with (,)? .
Then based on first step, the whole string is match due to + after the grouping ((\d+(,)?)+)
Then at the end, there is nothing for /$ to be matched. Therefore, (\d+(,)?)+ needs to "backtrack" to one character before the last for check for /$. Again, it don't find any proper match, so after that it's (,)'s turn to backtrack, then \d+ will backtrack, and this backtracking will be continue to end till it return None. So based on the length of your string it takes time, which in this case is very high, and it create a nested quantifiers entirely!

As an approximately benchmarking, in this case, you have 39 character so you need 3^39 backtracking attempts (we have 3 methods for backtrack).

Now for better understanding, I measure the runtime of the program while changing the length of the string:

'12345121,223456,123123,3234,4523,' 3^33 = 5.559060567×10¹⁵
~/Desktop $ time python ex.py

real    0m3.814s
user    0m3.818s
sys     0m0.000s

'12345121,223456,123123,3234,4523,5' 3^24 = 1.66771817×10¹⁶ #X2 before
~/Desktop $ time python ex.py

real    0m5.846s
user    0m5.837s
sys     0m0.015s

'12345121,223456,123123,3234,4523,523' 3^36= 1.500946353×10¹⁷ #~10X before 
~/Desktop $ time python ex.py

real    0m15.796s
user    0m15.803s
sys     0m0.008s

So to avoid this problem you can use one of the below ways:

Atomic grouping (Currently doesn't support in Python, A RFE was created to add it to Python 3)
Reduction the possibility of backtracking by breaking the nested groups to separate regexes.

Solution 3:

To avoid the catastrophic backtracking I suggest

r'\d+(,\d+)*/$'

Why does this take so long to match? Is it a bug?

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts