What's the proof of correctness for Robert Floyd's algorithm for selecting a single, random combination of values?

I read about it in a SO answer:

Algorithm to select a single, random combination of values?

initialize set S to empty
for J := N-M + 1 to N do
    T := RandInt(1, J)
    if T is not in S then
        insert T in S
    else
        insert J in S

What is the proof of correctness for this? I've spent quite a bit of time on it and can prove to myself the $N^\text{th}$ value and the $1^\text{st}$ - $(N-M+1)^\text{th}$ values have $P$ (chosen) $= M/N$, but not the ones that remain.

e.g.
For $N$ choose $M$, each item enumerated from $1$ to $N$, I can prove to myself that using this algorithm, items $1$, $2$, $\ldots$, $N-M$, $N-M+1$, and $N$ each have the probability of $M/N$ to be chosen. I don't know for the other remaining items though.

e.g. 2
For $8$ choose $4$, each item enumerated from $1$ to $8$, I can prove to myself that using this algorithm, items $1$, $2$, $3$, $4$, $5$, and $8$ each have the probability of $4/8$ to be chosen.

Consider a fixed number $k\in [n]$ and denote by $T_j$ $\,(n-m\leq j\leq n)$ the time after the value $j$ has been processed; whence $T_{n-m}$ is the starting time. We argue about the probability $P_j$ that $k$ is not yet chosen at time $T_j$.

When $k\leq n-m$ then $P_{n-m}=1$, and $$P_j=\biggl(1-{1\over j}\biggr)P_{j-1}\qquad(n-m+1\leq j\leq n)\ .$$ It follows that $$P_n=\prod_{j=n-m+1}^n\biggl({j-1\over j}\biggr)={n-m\over n}\ ,$$ as required.

When $k>n-m$ then $P_{k-1}=1$, because up to time $T_{k-1}$ only numbers $\leq k-1$ have been selected, in fact exactly $k-1-(n-m)$ of them. Therefore, when processing the value $j:=k$ there are $n-m$ numbers amongst the $\leq k-1$ left, and if one of these is chosen in step $j:=k$, the number $k$ is not jet chosen at time $T_k$. This implies that we have $$P_k={n-m\over k} P_{k-1}={n-m\over k}\ .$$ For the remaining steps we get as before $$P_n={n-m\over k}\prod_{j=k+1}^n\biggl({j-1\over j}\biggr)={n-m\over n}\ .$$

This can be proved by induction. The base cases are those with $M=1$, for which the algorithm obviously works.

As you noted, $N$ has the right probability $M/N$ of being chosen. Everything that happens before the last step in which $N$ might be chosen would happen just the same for $N-1$. Thus, if $N$ is chosen, by the induction hypothesis the rest of the combination is a uniformly random choice of $M-1$ out of $N-1$, whereas if $N$ isn't chosen, by the induction hypothesis the rest of the combination is a uniformly random choice of $M-1$ out of $N-1$ with one more uniformly random unused element added, which turns it into a uniformly random choice of $M$ out of $N-1$. This together with $N$ having the right probability of being chosen implies that the overall combination is uniformly randomly chosen.

Base case

For $_{N}C_{M}$ where $M = 1$, T := RandInt(1, J) will execute with $J = N-M+1 = N$, randomly selecting one item $I$ of the $N$ items, as is expected for $_{N}C_{1}$. In other words, the trivial base case is $P_{chosen}(I_{i\in1..N}|_NC_1) = {1 \over N}$, which I'll denote using

$P_{N,1}(I_i) = {1 \over N}$

Inductive hypothesis

$P_{N,M}(I_i) = {M \over N}$ for choosing $M$ items from $N$

Inductive step

The inductive step is to solve for

$P_{N',M'}(I_i)$ when $N'=N+1$ and $M'=M+1$

The inductive step is executed in the algorithm at every additional loop iteration.

For items $I_{i\in1..N}$

$P_{N',M'}(I_{i\in1..N}) = P_{N,M}(I_i) + \neg P_{N,M}(I_i) * {1 \over N'}$

$P_{N', M'}(I_{i\in1..N})$ = "For items $I_i$ except $I_{N'}$ in the additional iteration"

$P_{N,M}(I_i)$ = "The chance of $I_i$ to have already been chosen earlier"

$\neg P_{N,M}(I_i)$ = "The chance of $I_i$ to have not yet been chosen earlier"

$1 \over N'$ = "The chance of $I_i$ to be chosen in the additional iteration by RandInt()"

Substitute in the hypothesis and simplify:

$P_{N',M'}(I_{i\in1..N}) = {M \over N} + (1-{M \over N}) * {1 \over N+1}$

$P_{N',M'}(I_{i\in1..N}) = {M+1 \over N+1}$

For for the last item $I_{N'}$ added in the additional iteration

$P_{N',M'}(I_{i=N+1}) = {1 \over N'} + {M \over N'} = {M+1 \over N+1}$

${M \over N'}$ = "The chance that one of the $M$ items already chosen earlier is chosen again in the additional iteration"

Now we know for all items $I_i$ in the additional iteration

$P_{N',M'}(I_i) = {M+1 \over N+1} = {M' \over N'}$

which completes the inductive step.

This proof works because for any $N'$ and $M'$, the inductive step reduces the problem to the sub-problem, $N=N'-1$ and $M=M'-1$. Eventually, $M$ will be reduced to $1$, giving us the base case, $_NC_1$.

Since the algorithm guarantees both

$P(I)={M \over N}$ for all items $I$
Exactly $M$ of $N$ items will be selected

We know the algorithm correctly selects a random combination of $M$ items from a set of $N$ items.

What's the proof of correctness for Robert Floyd's algorithm for selecting a single, random combination of values?

Related

Recent Posts