How to get all overlapping matches in python regex that may start at the same location in a string?

How do I get all possible overlapping matches in a string in Python with multiple starting and ending points.

I've tried using regex module, instead of default re module to introduce overlapped = True argument, but still it is missing some matches.

Trying to describe my problem via a simpler illustration:

Find all possible combinations in the string (axaybzb) starting with a and ending with b

Tried following codes:

import regex

print(regex.findall(r'a\w+b','axaybzb', overlapped=False))

['axaybzb']

print(regex.findall(r'a\w+?b','axaybzb', overlapped=False))

['axayb']

print(regex.findall(r'a\w+b','axaybzb', overlapped=True))

['axaybzb', 'aybzb']

print(regex.findall(r'a\w+?b','axaybzb', overlapped=True))

['axayb', 'ayb']

Expected output to be

['axayb', 'axaybzb', 'ayb', 'aybzb']

Solution 1:

Regex are not the proper tool here, I would recommend:

Identify all the indexes of the first letter in the input string
Identify all the indexes of the second letter in the input string
Build all the substrings based on those indexes

code:

def find(str, ch):
    for i, ltr in enumerate(str):
        if ltr == ch:
            yield i

s = "axaybzb"
startChar = 'a'
endChar = 'b'

startCharList = list(find(s,startChar))
endCharList = list(find(s,endChar))

output = []
for u in startCharList:
    for v in endCharList:
           if u <= v:
               output.append(s[u:v+1])
print(output)

output:

$ python substring.py 
['axayb', 'axaybzb', 'ayb', 'aybzb']

Solution 2:

With simple patterns like yours, you may generate slices of all consecutive chars in a string and test them all against a specific regex for a full match:

import re

def findall_overlapped(r, s):
  res = []                     # Resulting list
  reg = r'^{}$'.format(r)      # Regex must match full string
  for q in range(len(s)):      # Iterate over all chars in a string
    for w in range(q,len(s)):  # Iterate over the rest of the chars to the right
        cur = s[q:w+1]         # Currently tested slice
        if re.match(reg, cur): # If there is a full slice match
            res.append(cur)    # Append it to the resulting list
  return res

rex = r'a\w+b'
print(findall_overlapped(rex, 'axaybzb'))
# => ['axayb', 'axaybzb', 'ayb', 'aybzb']

See the Python demo

WARNING: Note this won't work if you have a pattern checking left- or right-hand contexts, with lookaheads or lookbehinds on either end of the pattern since this context will be lost when iterating over the string.

How to get all overlapping matches in python regex that may start at the same location in a string?

Solution 1:

Solution 2:

Related

Recent Posts