Find longest repetitive sequence in a string

Solution 1:

This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).

That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.

Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.

There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn't too hard to code up), and Kasai's algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.

Hope this helps!

Solution 2:

Use defaultdict to tally each substring beginning with each position in the input string. The OP wasn't clear whether overlapping matches should or shouldn't be included, this brute force method includes them.

from collections import defaultdict

def getsubs(loc, s):
    substr = s[loc:]
    i = -1
    while(substr):
        yield substr
        substr = s[loc:i]
        i -= 1

def longestRepetitiveSubstring(r, minocc=3):
    occ = defaultdict(int)
    # tally all occurrences of all substrings
    for i in range(len(r)):
        for sub in getsubs(i,r):
            occ[sub] += 1

    # filter out all substrings with fewer than minocc occurrences
    occ_minocc = [k for k,v in occ.items() if v >= minocc]

    if occ_minocc:
        maxkey =  max(occ_minocc, key=len)
        return maxkey, occ[maxkey]
    else:
        raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))

prints:

('helloworld', 3)