Algorithm puzzle interview

"Subsequence" usually means noncontiguous. I'm going to assume that you meant "sublist".

Here's an O(N P) algorithm assuming we can hash (assumption not needed; we can radix sort instead). Scan the array keeping a running total for each number. For your example,

  1  2  3
 --------
  0  0  0
1 
  1  0  0
2
  1  1  0
1
  2  1  0
3
  2  1  1
2
  2  2  1
1
  3  2  1
3
  3  2  2
1
  4  2  2
2
  4  3  2
3
  4  3  3
1
  5  3  3

Now, normalize each row by subtracting the minimum element. The result is

 0: 000
 1: 100
 2: 110
 3: 210
 4: 100
 5: 110
 6: 210
 7: 100
 8: 200
 9: 210
10: 100
11: 200.

Prepare two hashes, mapping each row to the first index at which it appears and the last index at which it appears. Iterate through the keys and take the one with maximum last - first.

000: first is at 0, last is at 0
100: first is at 1, last is at 10
110: first is at 2, last is at 5
210: first is at 3, last is at 9
200: first is at 8, last is at 11

The best key is 100, since its sublist has length 9. The sublist is the (1+1)th element to the 10th.

This works because a sublist is balanced if and only if its first and last unnormalized histograms are the same up to adding a constant, which occurs if and only if the first and last normalized histograms are identical.


If the memory usage is not important, it's easy...

You can give the matrix dimensions N*p and save in column (i) the value corresponding to how many elements p is looking between (i) first element in the second vector...

After completing the matrix, you can search for column i that all of the elements in column i are not different. The maximum i is the answer.


With randomization, you can get it down to linear time. The idea is to replace each of the P values with a random integer, such that those integers sum to zero. Now look for two prefix sums that are equal. This allows some small chance of false positives, which we could remedy by checking our output.

In Python 2.7:

# input:
vec1 = [1, 2, 3]
P = len(vec1)
vec2 = [1, 2, 1, 3, 2, 1, 3, 1, 2, 3, 1]
N = len(vec2)
# Choose big enough integer B.  For each k in vec1, choose
# a random mod-B remainder r[k], so their mod-B sum is 0.
# Any P-1 of these remainders are independent.
import random
B = N*N*N
r = dict((k, random.randint(0,B-1)) for k in vec1)
s = sum(r.values())%B
r[vec1[0]] = (r[vec1[0]]+B-s)%B
assert sum(r.values())%B == 0
# For 0<=i<=N, let vec3[i] be mod-B sum of r[vec2[j]], for j<i.
vec3 = [0] * (N+1)
for i in range(1,N+1):
    vec3[i] = (vec3[i-1] + r[vec2[i-1]]) % B
# Find pair (i,j) so vec3[i]==vec3[j], and j-i is as large as possible.
# This is either a solution (subsequence vec2[i:j] is uniform) or a false
# positive.  The expected number of false positives is < N*N/(2*B) < 1/N.
(i, j)=(0, 0)
first = {}
for k in range(N+1):
    v = vec3[k]
    if v in first:
        if k-first[v] > j-i:
            (i, j) = (first[v], k)
    else:
        first[v] = k
# output:
print "Found subsequence from", i, "(inclusive) to", j, "(exclusive):"
print vec2[i:j]
print "This is either uniform, or rarely, it is a false positive."

Here is an observation: you can't get a uniformly distributed sequence that is not a multiplication of P in length. This implies that you only have to check the sub-sequences of N that are P, 2P, 3P... long - (N/P)^2 such sequences.