"Nothing to repeat" from Python regex
Here is a regex - attempted by egrep and then by Python 2.7:
$ echo '/some/path/to/file/abcde.csv' | egrep '*([a-zA-Z]+).csv'
/some/path/to/file/abcde.csv
However, the same regex in Python:
re.match(r'*([a-zA-Z]+)\.csv',f )
Gives:
Traceback (most recent call last):
File "/shared/OpenChai/bin/plothost.py", line 26, in <module>
hosts = [re.match(r'*([a-zA-Z]+)\.csv',f ).group(1) for f in infiles]
File "/usr/lib/python2.7/re.py", line 141, in match
return _compile(pattern, flags).match(string)
File "/usr/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
Doing a search reveals there appears to be a Python bug in play here:
regex error - nothing to repeat
It seems to be a python bug (that works perfectly in vim). The source of the problem is the (\s*...)+ bit.
However, it is not clear to me: what then is the workaround for my regex shown above - to make python happy?
Thanks.
You do not need the *
in the pattern, it causes the issue because you are trying to quantify the beginning of the pattern, but there is nothing, an empty string, to quantify.
The same "Nothing to repeat
" error occurs when you
- Place any quantifier (
+
,?
,*
,{2}
,{4,5}
, etc.) at the start of the pattern (e.g.re.compile(r'?')
) - Add any quantifier right after
^
/\A
start of string anchor (e.g.re.compile(r'^*')
) - Add any quantifier right after
$
/\Z
end of string anchor (e.g.re.compile(r'$*')
) - Add any quantifier after a word boundary (e.g.
re.compile(r'\b*\d{5}')
)
Note, however, that in Python re
, you may quantify any lookaround, e.g. (?<!\d)*abc
and (?<=\d)?abc
will yield the same matches since the lookarounds are optional.
Use
([a-zA-Z]+)\.csv
Or to match the whole string:
.*([a-zA-Z]+)\.csv
See demo
The reason is that *
is unescaped and is thus treated as a quantifier. It is applied to the preceding subpattern in the regex. Here, it is used in the beginning of a pattern, and thus cannot quantify nothing. Thus, nothing to repeat is thrown.
If it "works" in VIM, it is just because VIM regex engine ignores this subpattern (same as Java does with unescaped [
and ]
inside a character class like [([)]]
).
It's not a bug python regex engine use traditional NFA for matching patterns. and character *
just works when precede by a token.
'*'
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
So instead you can use .*
which repeat any character (.
) :
r'.*([a-zA-Z]+)\.csv'
Also python provide the module fnmatch
which support Unix shell-style wildcards.
>>> import fnmatch
>>> s="/some/path/to/file/abcde.csv"
>>> fnmatch.fnmatch(s, '*.csv')
True