Analysis of a calculation of expected number of collisions in hashing

For a formal problem statement, I quote from the text Introduction to Algorithms by Cormen et. al

Suppose we use a hash function $h$ to hash $n$ distinct keys into an array $T$ of length $m$. Assuming simple uniform hashing, what is the expected number of collisions? More precisely, what is the expected cardinality of $\{\{k, l\} : k\neq l \text{ and } h(k)= h(l)\}$ ?


Quite intuitively I proceeded as follows:

Let $X$ be the random variable indicating the number of collisions. Let us define an indicator random variable $X_{ij}$ which indicates whether the $i$th element ($e_i$) already in the table collides with the $j$ th element ($e_j$) when the $j$ th element is to be inserted.

$Pr\{h(e_i)=h(e_j)\}=\frac{1}{m}$. So $E[X_{ij}]=\frac{1}{m}$.

So based on that we have:

$$X= \sum_{i=1}^n \sum_{j=i+1}^n X_{ij}$$

Taking expectation on both sides we have: $$E[X]= E\left[\sum_{i=1}^n \sum_{j=i+1}^n X_{ij}\right]$$

Using linearity of expectation:

$$E[X]= \sum_{i=1}^n \sum_{j=i+1}^n E[X_{ij}]$$

But, $E[X_{ij}]=\frac{1}{m}$ , which is already found above.

So,

$$E[X]= \sum_{i=1}^n \sum_{j=i+1}^n \frac{1}{m}$$

$$=\frac{n(n-1)}{2m}$$


Below is how my peer approached:

Let $X$ be the random variable indicating the number of collisions. Let $X_i$ be the indicator random variable indicating collision during the $i$ th insertion of the element, $i=1,2,3,..,n$

So, $$X=\sum_{i=1}^n X_i$$

taking expectation on both sides, we have:

$$E[X]=E\left[\sum_{i=1}^n X_i\right]$$

Using linearity of expectation:

$$E[X]=\sum_{i=1}^n E[X_i]$$

Now let us find $Pr\{X_i=1\}$ then by the property of indicator random variable we shall have $E[X_i]=Pr\{X_i=1\}$.

Probability that there is collision during the first insertion =$0$ [First element is inserted without any collision.]

Probability that there is collision during the second insertion= $\frac{1}{m}$ [Assuming open addressing, $1$ slot is already occupied.]

Probability that there is collision during the third insertion= $\frac{2}{m}$ [Assuming open addressing, $2$ slots are already occupied.]

.

.

.

Probability that there is collision during the $i$th insertion= $\frac{i-1}{m}$ [Assuming open addressing, $i-1$ slots are already occupied.]

So,$$E[X]=\sum_{i=1}^n \frac{i-1}{m}=\frac{n(n-1)}{2m}$$


Though the final answer obtained by my peer is same as that of mine, but I guess there are quite a lot of issues with the approach. First it assumes open addressing, while my approach is a generalized one. Secondly, as per the formal problem statement given in the text, in the worst case if all elements hash to the same slot then we shall have $X=\binom{n}{2}$. So the spectrum of $X$ is , $X=0,1,2,..,\binom{n}{2}$

But the spectrum of $X$ as per my peer's method is $X=0,1,2..,n$. Is my peer's method actually correct?


Solution 1:

Your peer's method is incorrect. The probability the $i$th key collides with the previous keys is not $(i - 1) / m$. Furthermore, he fails to take into account that adding colliding with a pair of keys should add two collisions, not just one. By a happy coincidence, however, these errors cancel each other out.