Hash Map, when to use which method?

I've been learning about HashMaps and their best practices. One of the things that I stumbled upon was collision resolution.

The methods include:

  • Direct Chaining,
  • Linear Probing,
  • Quadratic Probing,
  • Double Hashing.

So far I've found that direct chaining was much easier to implement and made the most sense. I'm not sure which I should focus on in terms of being prepared for technical interviews.


Solution 1:

For technical interviews, I'd suggest getting a high level understanding of the pros/cons of these approaches - specifically:

  • direct chaining degrades slowly as load factor increases, whereas closed hashing/open addressing approaches (all the others you list) grow exponentially worse as load factor approaches 1.0 as it gets harder and harder to find an empty bucket
  • linear probing can be CPU cache friendly with small keys (compared to any of the other techniques): if you can get several keys onto the same cache line, that means the CPU is likely to spend less time groping around in memory after collisions (and SIMD instructions can sometimes help compare against multiple buckets' keys concurrently)
  • linear probing and identity hashing can produce lower-than-cryptographic-hashing collision rates for keys that happened to have a nice distribution across the buckets, such as ids that tend to increment but may have a gap here and there
  • linear probing is much more prone to clusters of collisions than quadratic probing, especially with poor hash functions / difficult-to-hash-well keys and higher load factors (e.g. ballpark >= 0.8), as collisions in one part of the table (even if just by chance more than flawed hashing) tend to exacerbate future use of that part of the table
  • quadratic probing can have a couple bucket offsets that might fall on the same cache line, so you might get a useful probability of the second and even third bucket being on the same cache line as the first, but after a few failures you'll jump off by larger increments, decreasing clustering problems at the expense of more cache misses
  • double hashing is a bit of a compromise; if the second hash happens to produce a 1 then it's equivalent to linear probing, but you might try every 2nd bucket, or every 3rd etc. - up to some limit. There's still plenty of room for clustering (e.g. if h2(k) returned 6 for one key, and 3 for another key that had hashed to a bucket 3 further into the table than the first key, then they'll visit many of the same buckets during their searches.

I wouldn't recommend focusing on any of them in two much depth, or ignoring any, because the contrasts reinforce your understanding of any of the others.