Birthday Problem applied to collisions

I'm trying to extend the birthday problem to detect collision probability in a hashing scheme. Here is my problem. I use the letters and numbers [A-Z][a-z][0-9] to make a set of keys by randomly choosing from that set 10 times and concatenating them all together. So A24cv30kf2 could be one key for example. This implies that the base set has $62^{10}$ possible keys. Next, I want to compute the probability that I would see a collision if I picked a key from that set and issued it to $n$ users. If $M$ represents the size of the set (= $62^{10}$) and $P(M,n)$ the probability then

$$P(M,n) = \frac{M!}{M^{n}(M-n)!}$$

This follows by replacing 365 in the well known birthday problem shown here

What I want to understand is the following: I want to check and see if my choice of concatenating 10 is right, when I know that my $n$ is of the order of a few millions and my $1 - P(M,n)$ needs to be less than 0.0001 or some small number. The large values of M & n make the above intractable to compute. Is there an easier way to compute this?


As in my answer here, the Poisson approximation is pretty accurate. $$\mathbb{P}(\text{no collision})\approx \exp(-n^2/2M)\approx 1-{n^2/2M}.$$ Setting this equal to .9999 and $M=62^{10}$, we find that we can issue around $n=12.95$ million keys without worrying.


Roughly speaking, you expect a collision when the number of pairs is the number of hashes. If you issue $n$ hashes, you have $\frac {n^2}2$ pairs (I ignored the $-1$ on one of the $n$'s), so you can afford to issue about $\sqrt{2M}=\sqrt {2\cdot 62^{10}}=62^5\sqrt 2\approx 1.3\cdot 10^9$ keys. In the Wikipedia article it shows there is another factor of $\sqrt {\ln 2} \approx 0.83$ which I would ignore but it says you can issue one billion codes before you have a $50\%$ chance of collision.


These expressions can be evaluated numerically using the standard "trick" of writing it as the exponential ($\exp$) of their (natural) logarithm ($\log$), and remembering that $\log(ab)=\log(a)+\log(b)$:

$P(M,n)=\exp\left( \log M!-n\log M-\log(M-n)! \right)$

Now $N!=\Gamma(N+1)$, and most software packages have the function that calculates $\log\Gamma(x)$ to a reasonable accuracy.

In "R", for instance, your code would look like

P <- function(M, n) exp( lgamma(M+1)-n*log(M)-lgamma(M-n+1) )

Other possibility is cancelling out the factorials:

$P(M,n)=\frac{M(M-1)...(M-n+1)((M-n)!)}{M^n(M-n)!}=\frac{M}{M}\frac{M-1}{M}...\frac{M-n+1}{M}$

With an $M$ as large as 62^10, both approaches may lose some precision. In fact, in 64-bit double floating point, $62^{10}=62^{10}-1=62^{10}-2=...=62^{10}-60$.

So I believe you need a software that can deal with arbitrary precision rationals to get what you're looking for. You can try package GMP for "R", or mathematica, or matlab.

EDIT after reading other answers.

Computational comments:

Using arbitrary precision rationals is unfeasible for large $n$, since the this is the number of factor in the product above.

Using Stirling lower and upper bounds, as suggested by @GregMartin only worked (in double precision) up to $M=10^{14}$ and $N=10^6$.

The good news is that a multi-precision floating point package enables us to calculate this simply with the code provided above. Some results:

One million keys:

> M <- mpfr(62, precBits=2048)^10
> n <- mpfr(10, precBits=2048)^6
> P(M,n)
1 'mpfr' number of precision  2048   bits 
[1] 0.99999940426578238924518...

10 million keys:

> n <- mpfr(10, precBits=2048)^7
> P(M,n)
1 'mpfr' number of precision  2048   bits 
[1] 0.99994042828134289665284...

100 million keys: (I'd start to get worried...)

> n <- mpfr(10, precBits=2048)^8
> P(M,n)
1 'mpfr' number of precision  2048   bits 
[1] 0.994060359974671181685...

(I've used package Rmpfr)