Python regular expression pattern * is not working as expected

While working through Google's 2010 Python class, I found the following documentation:

'*' -- 0 or more occurrences of the pattern to its left

But when I tried the following

re.search(r'i*','biiiiiiiiiiiiiig').group() 

I expected 'iiiiiiiiiiiiii' as output but got ''. Why?


Solution 1:

* means 0 or more but re.search would return only the first match. Here the first match is an empty string. So you get an empty string as output.

Change * to + to get the desired output.

>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i+','biiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'

Consider this example.

>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i*','iiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'

Here i* returns iiiiiiiiiiiiii because at first , the regex engine tries to match zero or more times of i. Once it finds i at the very first, it matches greedily all the i's like in the second example, so you get iiiiiiii as output and if the i is not at the first (consider this biiiiiiig string), i* pattern would match all the empty string before the every non-match, in our case it matches all the empty strings that exists before b and g. Because re.search returns only the first match, you should get an empty string because of the non-match b at the first.

Why i got three empty strings as output in the below example?

>>> re.findall(r'i*','biiiiiiiiiiiiiig')
['', 'iiiiiiiiiiiiii', '', '']

As i explained earlier, for every non-match you should get an empty string as match. Let me explain. Regex engine parses the input from left to right.

  1. First empty string as output is because the pattern i* won't match the character b but it matches the empty string which exists before the b.

  2. Now the engine moves to the next character that is i which would be matched by our pattern i*, so it greedily matches the following i's . So you get iiiiiiiiiiiiii as the second.

  3. After matching all the i's, it moves to the next character that is g which isn't matched by our pattern i* . So i* matches the empty string before the non-match g. That's the reason for the third empty string.

  4. Now our pattern i* matches the empty string which exists before the end of the line. That's the reason for fourth empty string.