Collision probability of ObjectId vs UUID in a large distributed system

Considering that an UUID rfc 4122 (16 bytes) is much larger than a MongoDB ObjectId (12 bytes), I am trying to find out how their collision probability compare.

I know that is something around quite unlikely, but in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.

Compared to the normal case where all ids are generated by a small number of clients:

  • It might take months to detect a collision since the document creation
  • IDs are generated from a much larger client base
  • Each client has a lower ID generation rate

in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.

That sounds like very bad architecture to me. Are you using a two-tier architecture? Why would the mobile clients have direct access to the db? Do you really want to rely on network-based security?

Anyway, some deliberations about the collision probability:

Neither UUID nor ObjectId rely on their sheer size, i.e. both are not random numbers, but they follow a scheme that tries to systematically reduce collision probability. In case of ObjectIds, their structure is:

  • 4 byte seconds since unix epoch
  • 3 byte machine id
  • 2 byte process id
  • 3 byte counter

This means that, contrary to UUIDs, ObjectIds are monotonic (except within a single second), which is probably their most important property. Monotonic indexes will cause the B-Tree to be filled more efficiently, it allows paging by id and allows a 'default sort' by id to make your cursors stable, and of course, they carry an easy-to-extract timestamp. These are the optimizations you should be aware of, and they can be huge.

As you can see from the structure of the other 3 components, collisions become very likely if you're doing > 1k inserts/s on a single process (not really possible, not even from a server), or if the number of machines grows past about 10 (see birthday problem), or if the number of processes on a single machine grows too large (then again, those aren't random numbers, but they are truly unique on a machine, but they must be shortened to two bytes).

Naturally, for a collision to occur, they must match in all these aspects, so even if two machines have the same machine hash, it'd still require a client to insert with the same counter value in the exact same second and the same process id, but yes, these values could collide.


Let's look at the spec for "ObjectId" from the documentation:

Overview

ObjectId is a 12-byte BSON type, constructed using:

  • a 4-byte value representing the seconds since the Unix epoch,
  • a 3-byte machine identifier,
  • a 2-byte process id, and
  • a 3-byte counter, starting with a random value.

So let us consider this in the context of a "mobile client".

Note: The context here does not mean using a "direct" connection of the "mobile client" to the database. That should not be done. But the "_id" generation can be done quite simply.

So the points:

  1. Value for the "seconds since epoch". That is going to be fairly random per request. So minimal collision impact just on that component. Albeit in "seconds".

  2. The "machine identifier". So this is a different client generating the _id value. This is removing possibility of further "collision".

  3. The "process id". So where that is accessible to seed ( and it should be ) then the generated _id has more chance of avoiding collision.

  4. The "random value". So another "client" somehow managed to generate all of the same values as above and still managed to generate the same random value.

Bottom line is, if that is not a convincing enough argument to digest, then simply provide your own "uuid" entries as the "primary key" values.

But IMHO, that should be a fair convincing argument to consider that the collision aspects here are very broad. To say the least.

The full topic is probably just a little "too-broad". But I hope this moves consideration a bit more away from "Quite unlikely" and on to something a little more concrete.