Analysis of a calculation of expected number of collisions in hashing
For a formal problem statement, I quote from the text Introduction to Algorithms by Cormen et. al
Suppose we use a hash function $h$ to hash $n$ distinct keys into an array $T$ of length $m$. Assuming simple uniform hashing, what is the expected number of collisions? More precisely, what is the expected cardinality of $\{\{k, l\} : k\neq l \text{ and } h(k)= h(l)\}$ ?
Quite intuitively I proceeded as follows:
Let $X$ be the random variable indicating the number of collisions. Let us define an indicator random variable $X_{ij}$ which indicates whether the $i$th element ($e_i$) already in the table collides with the $j$ th element ($e_j$) when the $j$ th element is to be inserted.
$Pr\{h(e_i)=h(e_j)\}=\frac{1}{m}$. So $E[X_{ij}]=\frac{1}{m}$.
So based on that we have:
$$X= \sum_{i=1}^n \sum_{j=i+1}^n X_{ij}$$
Taking expectation on both sides we have: $$E[X]= E\left[\sum_{i=1}^n \sum_{j=i+1}^n X_{ij}\right]$$
Using linearity of expectation:
$$E[X]= \sum_{i=1}^n \sum_{j=i+1}^n E[X_{ij}]$$
But, $E[X_{ij}]=\frac{1}{m}$ , which is already found above.
So,
$$E[X]= \sum_{i=1}^n \sum_{j=i+1}^n \frac{1}{m}$$
$$=\frac{n(n-1)}{2m}$$
Below is how my peer approached:
Let $X$ be the random variable indicating the number of collisions. Let $X_i$ be the indicator random variable indicating collision during the $i$ th insertion of the element, $i=1,2,3,..,n$
So, $$X=\sum_{i=1}^n X_i$$
taking expectation on both sides, we have:
$$E[X]=E\left[\sum_{i=1}^n X_i\right]$$
Using linearity of expectation:
$$E[X]=\sum_{i=1}^n E[X_i]$$
Now let us find $Pr\{X_i=1\}$ then by the property of indicator random variable we shall have $E[X_i]=Pr\{X_i=1\}$.
Probability that there is collision during the first insertion =$0$ [First element is inserted without any collision.]
Probability that there is collision during the second insertion= $\frac{1}{m}$ [Assuming open addressing, $1$ slot is already occupied.]
Probability that there is collision during the third insertion= $\frac{2}{m}$ [Assuming open addressing, $2$ slots are already occupied.]
.
.
.
Probability that there is collision during the $i$th insertion= $\frac{i-1}{m}$ [Assuming open addressing, $i-1$ slots are already occupied.]
So,$$E[X]=\sum_{i=1}^n \frac{i-1}{m}=\frac{n(n-1)}{2m}$$
Though the final answer obtained by my peer is same as that of mine, but I guess there are quite a lot of issues with the approach. First it assumes open addressing, while my approach is a generalized one. Secondly, as per the formal problem statement given in the text, in the worst case if all elements hash to the same slot then we shall have $X=\binom{n}{2}$. So the spectrum of $X$ is , $X=0,1,2,..,\binom{n}{2}$
But the spectrum of $X$ as per my peer's method is $X=0,1,2..,n$. Is my peer's method actually correct?
Solution 1:
Your peer's method is incorrect. The probability the $i$th key collides with the previous keys is not $(i - 1) / m$. Furthermore, he fails to take into account that adding colliding with a pair of keys should add two collisions, not just one. By a happy coincidence, however, these errors cancel each other out.