Coding the mathematical approach for finding index of a permutation that has repetition
In this answer I will explain how to get to a function that will enable you to get the index of an element in the sequence as follows
print("Item 3749832414 is at (0-based) index %d" %
item_index('3749832414', alphabet='0123456789', length=10, distinct=8))
print("Item 7364512193 is at (0-based) index %d" %
item_index('7364512193', alphabet='0123456789', length=10, distinct=8))
> Item 3749832414 is at (0-based) index 508309342
> Item 7364512193 is at (0-based) index 1005336982
Enumeration method
By the nature of your problem it is interesting to solve it in a recursive manner, adding digits one by one and keeping track of the number of digits used. Python provide iterators so that you can produce items one by one without storing the whole sequence.
Basically all the items can be arranged in a prefix tree, and we walk the three yielding the leaf nodes.
def iter_seq(alphabet, length, distinct, prefix=''):
if distinct < 0:
# the prefix used more than the allowed number of distinct digits
return
if length == 0:
# if distinct > 0 it means that prefix did not use
# enought distinct digits
if distinct == 0:
yield prefix
else:
for d in alphabet:
if d in prefix:
# the number of distinct digits in prefix + d is the same
# as in prefix.
yield from iter_seq(alphabet, length-1, distinct, prefix + d)
else:
# the number of distinct digits in prefix + d is one more
# than the distinct digits in prefix.
yield from iter_seq(alphabet, length-1, distinct-1, prefix + d)
Let's test it with examples that can be visualized
list(iter_seq('0123', 5, 1))
['00000', '11111', '22222', '33333']
import numpy as np
np.reshape(list(iter_seq('0123', 4, 2)), (12, 7))
array([['0001', '0002', '0003', '0010', '0011', '0020', '0022'],
['0030', '0033', '0100', '0101', '0110', '0111', '0200'],
['0202', '0220', '0222', '0300', '0303', '0330', '0333'],
['1000', '1001', '1010', '1011', '1100', '1101', '1110'],
['1112', '1113', '1121', '1122', '1131', '1133', '1211'],
['1212', '1221', '1222', '1311', '1313', '1331', '1333'],
['2000', '2002', '2020', '2022', '2111', '2112', '2121'],
['2122', '2200', '2202', '2211', '2212', '2220', '2221'],
['2223', '2232', '2233', '2322', '2323', '2332', '2333'],
['3000', '3003', '3030', '3033', '3111', '3113', '3131'],
['3133', '3222', '3223', '3232', '3233', '3300', '3303'],
['3311', '3313', '3322', '3323', '3330', '3331', '3332']],
dtype='<U4')
Counting items
As you noticed by your previous question, the number of items in a sequence only depends on the length of each string, the size of the alphabet, and the number of distinct symbols.
If we look to the loop of the above function, we only have two cases, (1) the current digit is in the prefix, (2) the digit is not in the prefix. The number of times the digit will be in the prefix is exactly the number of distinct digits in the prefix. So we can add an argument used
to keep track of the number of digits already used instead of the actual prefix. Now the complexity goes from O(length!)
to O(2**length)
.
Additionally we use a lru_cache
decorator that will memorize the values and return them without calling the function if the arguments are repetaed, this makes the function to run in O(length**2)
time and space.
from functools import lru_cache
@lru_cache
def count_seq(n_symbols, length, distinct, used=0):
if distinct < 0:
return 0
if length == 0:
return 1 if distinct == 0 else 0
else:
return \
count_seq(n_symbols, length-1, distinct-0, used+0) * used + \
count_seq(n_symbols, length-1, distinct-1, used+1) * (n_symbols - used)
We can that it is consistent with iter_seq
assert(sum(1 for _ in iter_seq('0123', 4, 2)) == count_seq(4, 4, 2))
We can also test that it aggrees with the example you calculated by hand
assert(count_seq(10, 10, 8) == 1360800000)
Item at index
This part is not necessary to get the final answer but it is a good exercise. Furthermore it will give us a way to compute larger sequences that would be tedious by hand.
This could be achieved by iterating iter_seq
the given number of times. This function does that more efficiently by comparing the number of leaves in a given subtree (number of items produced by a specific call) with the distance to the requested index. If the requested index is distant more than the number of items produced by a call it means we can skip that call at all, and jump directly to the next sibling in the tree.
def item_at(idx, alphabet, length, distinct, used=0, prefix=''):
if distinct < 0:
return
if length == 0:
return prefix
else:
for d in alphabet:
if d in prefix:
branch_count = count_seq(len(alphabet),
length-1, distinct, used)
if branch_count <= idx:
idx -= branch_count
else:
return item_at(idx, alphabet,
length-1, distinct, used, prefix + d)
else:
branch_count = count_seq(len(alphabet),
length-1, distinct-1, used+1)
if branch_count <= idx:
idx -= branch_count
else:
return item_at(idx, alphabet,
length-1, distinct-1, used+1, prefix + d)
We can test that it is consistent with iter_seq
for i, w in enumerate(iter_seq('0123', 4, 2)):
assert w == item_at(i, '0123', 4, 2)
Index of a given item
Remembering that we are walking in a prefix tree, given a string we can walk directly to the desired node. The way to find the index is to sum the size of all the subtrees that are left behind on this path.
def item_index(item, alphabet, length, distinct, used=0, prefix=''):
if distinct < 0:
return 0
if length == 0:
return 0
else:
offset = 0
for d in alphabet:
if d in prefix:
if d == item[0]:
return offset + item_index(item[1:], alphabet,
length-1, distinct, used, prefix + d)
else:
offset += count_seq(len(alphabet),
length-1, distinct, used)
else:
if d == item[0]:
return offset + item_index(item[1:], alphabet,
length-1, distinct-1, used+1, prefix + d)
else:
offset += count_seq(len(alphabet),
length-1, distinct-1, used+1)
And again we can test the consistency between this and iter_seq
for i,w in enumerate(iter_seq('0123', 4, 2)):
assert i == item_index(w, '0123', 4, 2)
Or to query for the example numbers you gave as I promised in the beginning of the post
print("Item 3749832414 is at (0-based) index %d" %
item_index('3749832414', alphabet='0123456789', length=10, distinct=8))
print("Item 7364512193 is at (0-based) index %d" %
item_index('7364512193', alphabet='0123456789', length=10, distinct=8))
> Item 3749832414 is at (0-based) index 508309342
> Item 7364512193 is at (0-based) index 1005336982
Bonus: Larger sequences
Let's calculate the index of UCP3gzjGPMwjYbYtsFu2sDHRE14XTu8AdaWoJPOm50YZlqI6skNyfvEShdmGEiB0
in the sequences of length 64 and 50 distinct symbols
item_index('UCP3gzjGPMwjYbYtsFu2sDHRE14XTu8AdaWoJPOm50YZlqI6skNyfvEShdmGEiB0',
alphabet='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
distinct=50, length=64)
Surprisingly it is 10000...000 = 10**110
. How could I find that particular string??
If we choose 3 symbols from the set {a, b, c, d, e, f}, there are 20 possible combinations. We can record these combinations in a integer such as:
{a, b, c} => 1
{a, b, d} => 2
{a, b, e} => 3
...
{d, e, f} => 20
Then after we finish choosing 3 symbols from the set, there will have 3^6 possible permutations. Then we can represent it in 12 bits. Take {a, b, c} for example, the representation can be:
aaaaaa => 00 00 00 00 00 00
aaaaab => 00 00 00 00 00 01
aaaaac => 00 00 00 00 00 10
...
cccccb => 10 10 10 10 10 01
cccccc => 10 10 10 10 10 10
Then you can use the combination of one integer and 12 bits binaries to index your permutation.