How to get all overlapping matches in python regex that may start at the same location in a string?

How do I get all possible overlapping matches in a string in Python with multiple starting and ending points.

I've tried using regex module, instead of default re module to introduce overlapped = True argument, but still it is missing some matches.

Trying to describe my problem via a simpler illustration:

Find all possible combinations in the string (axaybzb) starting with a and ending with b

Tried following codes:

import regex

print(regex.findall(r'a\w+b','axaybzb', overlapped=False))

['axaybzb']

print(regex.findall(r'a\w+?b','axaybzb', overlapped=False))

['axayb']

print(regex.findall(r'a\w+b','axaybzb', overlapped=True))

['axaybzb', 'aybzb']

print(regex.findall(r'a\w+?b','axaybzb', overlapped=True))

['axayb', 'ayb']

Expected output to be

['axayb', 'axaybzb', 'ayb', 'aybzb']

Solution 1:

Regex are not the proper tool here, I would recommend:

  • Identify all the indexes of the first letter in the input string
  • Identify all the indexes of the second letter in the input string
  • Build all the substrings based on those indexes

code:

def find(str, ch):
    for i, ltr in enumerate(str):
        if ltr == ch:
            yield i

s = "axaybzb"
startChar = 'a'
endChar = 'b'

startCharList = list(find(s,startChar))
endCharList = list(find(s,endChar))

output = []
for u in startCharList:
    for v in endCharList:
           if u <= v:
               output.append(s[u:v+1])
print(output)

output:

$ python substring.py 
['axayb', 'axaybzb', 'ayb', 'aybzb']

Solution 2:

With simple patterns like yours, you may generate slices of all consecutive chars in a string and test them all against a specific regex for a full match:

import re

def findall_overlapped(r, s):
  res = []                     # Resulting list
  reg = r'^{}$'.format(r)      # Regex must match full string
  for q in range(len(s)):      # Iterate over all chars in a string
    for w in range(q,len(s)):  # Iterate over the rest of the chars to the right
        cur = s[q:w+1]         # Currently tested slice
        if re.match(reg, cur): # If there is a full slice match
            res.append(cur)    # Append it to the resulting list
  return res

rex = r'a\w+b'
print(findall_overlapped(rex, 'axaybzb'))
# => ['axayb', 'axaybzb', 'ayb', 'aybzb']

See the Python demo

WARNING: Note this won't work if you have a pattern checking left- or right-hand contexts, with lookaheads or lookbehinds on either end of the pattern since this context will be lost when iterating over the string.