"Nothing to repeat" from Python regex

Here is a regex - attempted by egrep and then by Python 2.7:

$ echo '/some/path/to/file/abcde.csv' | egrep '*([a-zA-Z]+).csv'

/some/path/to/file/abcde.csv

However, the same regex in Python:

re.match(r'*([a-zA-Z]+)\.csv',f )

Gives:

Traceback (most recent call last):
  File "/shared/OpenChai/bin/plothost.py", line 26, in <module>
    hosts = [re.match(r'*([a-zA-Z]+)\.csv',f ).group(1) for f in infiles]
  File "/usr/lib/python2.7/re.py", line 141, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: nothing to repeat

Doing a search reveals there appears to be a Python bug in play here:

regex error - nothing to repeat

It seems to be a python bug (that works perfectly in vim). The source of the problem is the (\s*...)+ bit.

However, it is not clear to me: what then is the workaround for my regex shown above - to make python happy?

Thanks.


You do not need the * in the pattern, it causes the issue because you are trying to quantify the beginning of the pattern, but there is nothing, an empty string, to quantify.

The same "Nothing to repeat" error occurs when you

  • Place any quantifier (+, ?, *, {2}, {4,5}, etc.) at the start of the pattern (e.g. re.compile(r'?'))
  • Add any quantifier right after ^ / \A start of string anchor (e.g. re.compile(r'^*'))
  • Add any quantifier right after $ / \Z end of string anchor (e.g. re.compile(r'$*'))
  • Add any quantifier after a word boundary (e.g.re.compile(r'\b*\d{5}'))

Note, however, that in Python re, you may quantify any lookaround, e.g. (?<!\d)*abc and (?<=\d)?abc will yield the same matches since the lookarounds are optional.

Use

([a-zA-Z]+)\.csv

Or to match the whole string:

.*([a-zA-Z]+)\.csv

See demo

The reason is that * is unescaped and is thus treated as a quantifier. It is applied to the preceding subpattern in the regex. Here, it is used in the beginning of a pattern, and thus cannot quantify nothing. Thus, nothing to repeat is thrown.

If it "works" in VIM, it is just because VIM regex engine ignores this subpattern (same as Java does with unescaped [ and ] inside a character class like [([)]]).


It's not a bug python regex engine use traditional NFA for matching patterns. and character * just works when precede by a token.

'*'

Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

So instead you can use .* which repeat any character (.) :

r'.*([a-zA-Z]+)\.csv'

Also python provide the module fnmatch which support Unix shell-style wildcards.

>>> import fnmatch
>>> s="/some/path/to/file/abcde.csv"
>>> fnmatch.fnmatch(s, '*.csv')
True