Python re infinite execution

Solution 1:

Your regex runs into catastrophic backtracking because you have nested quantifiers (([...]+)*). Since your regex requires the string to end in / (which fails on your example), the regex engine tries all permutations of the string in the vain hope to find a matching combination. That's where it gets stuck.

To illustrate, let's assume "A*BCD" as the input to your regex and see what happens:

  1. (\w+) matches A. Good.
  2. \* matches *. Yay.
  3. [\w\s]+ matches BCD. OK.
  4. / fails to match (no characters left to match). OK, let's back up one character.
  5. / fails to match D. Hum. Let's back up some more.
  6. [\w\s]+ matches BC, and the repeated [\w\s]+ matches D.
  7. / fails to match. Back up.
  8. / fails to match D. Back up some more.
  9. [\w\s]+ matches B, and the repeated [\w\s]+ matches CD.
  10. / fails to match. Back up again.
  11. / fails to match D. Back up some more, again.
  12. How about [\w\s]+ matches B, repeated [\w\s]+ matches C, repeated [\w\s]+ matches D? No? Let's try something else.
  13. [\w\s]+ matches BC. Let's stop here and see what happens.
  14. Darn, / still doesn't match D.
  15. [\w\s]+ matches B.
  16. Still no luck. / doesn't match C.
  17. Hey, the whole group is optional (...)*.
  18. Nope, / still doesn't match B.
  19. OK, I give up.

Now that was a string of just three letters. Yours had about 30, trying all permutations of which would keep your computer busy until the end of days.

I suppose what you're trying to do is to get the strings before/after *, in which case, use

pattern = r"(\w+)\*([\w\s]+)$"

Solution 2:

Interestingly, Perl runs it very quickly

-> perl -e 'print "Match\n" if "COPRO*HORIZON 2000                 HOR" =~ m|(\w+)\*([\w\s]+)*/$|'
-> perl -e 'print "Match\n" if "COPRO*HORIZON 2000                 HOR/" =~ m|(\w+)\*([\w\s]+)*/$|'
Match

Solution 3:

Try re2 or any other regular expression engine base on automata theory. The one in a current python re module is a simple and slow backtracking engine (for now, things may change in future). But automata based engines have some restriction, it wouldn't allow you to use backreferences for example. Collate with this re2 syntax page to find out will it satisfy your needs or not.

Solution 4:

Looks like it might be something in your pattern. I'm not sure what you are trying to do with the last '*' in your expression. The following code seems to work for me:

import re

pattern = r"(\w+)\*([\w\s]+)$"

re_compiled = re.compile(pattern)

results = re_compiled.search('COPRO*HORIZON 2000                 HOR')

print(results.groups())