Uniqueness of UUID substring

This is an instance of the Birthday Problem. One formulation of B.P.: Given a choice of n values sampled randomly with replacement, how many values can we sample before the same value will be seen at least twice with probability p?

For the classic instance of the problem,

p = 0.5, n = the 365 days of the year

and the answer is 23. In other words, the odds are 50% that two people share the same birthday when you are surveying 23 people.

You can plug in

n = the number of possible UUIDs

instead to get that kind of cosmically large sample size required for a 50% probability of a collision — something like the billion-per-second figure. It is

n = 16^32

for a 32-character string of 16 case-insensitive hex digits.

B.P. a relatively expensive problem to compute, as there is no known closed-form formula for it. In fact, I just tried it for your 11-character substring (n = 16^11) on Wolfram Alpha Pro, and it timed out.

However, I found an efficient implementation of a closed-form estimate here. And here's my adaptation of the Python.

import math

def find(p, n):
   return math.ceil(math.sqrt(2 * n * math.log(1/(1-p))))

If I plug in the classic B.P. numbers, I get an answer of 23, which is right. For the full UUID numbers,

find(.5, math.pow(16, 32)) / 365 / 24 / 60 / 60 / 100

my result is actually close to 7 billion UUID per second for 100 years! Maybe this estimate is too coarse for large numbers, though I don't know what method your source used.

For the 11-character string? You only have to generate about 5 million IDs total to reach the 50% chance of a collision. For 1%, it's only about 600,000 total. And that's probably overestimating safety, compared to your source (and which we are already guilty of by assuming the substring is random).


My engineering advice: Do you really need the guarantees that UUIDs provide aside from uniqueness, such as non-enumerability, and assurance against collisions in a distributed context? If not, then just use a sequential ID, and avoid these complications.