Python regular expression pattern * is not working as expected
While working through Google's 2010 Python class, I found the following documentation:
'*'
-- 0 or more occurrences of the pattern to its left
But when I tried the following
re.search(r'i*','biiiiiiiiiiiiiig').group()
I expected 'iiiiiiiiiiiiii'
as output but got ''
. Why?
Solution 1:
*
means 0 or more but re.search
would return only the first match. Here the first match is an empty string. So you get an empty string as output.
Change *
to +
to get the desired output.
>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i+','biiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'
Consider this example.
>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i*','iiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'
Here i*
returns iiiiiiiiiiiiii
because at first , the regex engine tries to match zero or more times of i
. Once it finds i
at the very first, it matches greedily all the i
's like in the second example, so you get iiiiiiii
as output and if the i
is not at the first (consider this biiiiiiig
string), i*
pattern would match all the empty string before the every non-match, in our case it matches all the empty strings that exists before b
and g
. Because re.search
returns only the first match, you should get an empty string because of the non-match b
at the first.
Why i got three empty strings as output in the below example?
>>> re.findall(r'i*','biiiiiiiiiiiiiig')
['', 'iiiiiiiiiiiiii', '', '']
As i explained earlier, for every non-match you should get an empty string as match. Let me explain. Regex engine parses the input from left to right.
First empty string as output is because the pattern
i*
won't match the characterb
but it matches the empty string which exists before theb
.Now the engine moves to the next character that is
i
which would be matched by our patterni*
, so it greedily matches the followingi
's . So you getiiiiiiiiiiiiii
as the second.After matching all the
i
's, it moves to the next character that isg
which isn't matched by our patterni*
. Soi*
matches the empty string before the non-matchg
. That's the reason for the third empty string.Now our pattern
i*
matches the empty string which exists before the end of the line. That's the reason for fourth empty string.