Why is the size 127 (prime) better than 128 for a hash-table?

Solution 1:

All numbers (when hashed) are still going to be the p lowest-order bits of k for 127 too.

That is wrong (or I misunderstood..). k % 127 depends on all bits of k. k % 128 only depends on the 7 lowest bits.


EDIT:

If you have a perfect distribution between 1 and 10,000. 10,000 % 127 and 10,000 % 128 both will turn this in a excellent smaller distribution. All buckets will contain 10,000 /128 = 78 (or 79) items.

If you have a distribution between 1 and 10,000 that is biased, because {x, 2x, 3x, ..} occur more often. Then a prime size will give a much, much better distribution as explained in this answer. (Unless x is exactly that prime size.)

Thus, cutting off the high bits (using a size of 128) is no problem whatsoever if the distribution in the lower bits is good enough. But, with real data and real badly designed hash functions, you will need those high bits.

Solution 2:

Division Method

"When using the division method, we usually avoid certain values of m (table size). For example, m should not be a power of 2, since if m = 2p , then h(k) is just the p lowest-order bits of k."

--CLRS

To understand why m = 2p uses only the p lowest bits of k, you must first understand the modulo hash function h(k) = k % m.

The key can be written in terms of a quotient q, and remainder r.

k = nq + r

Choosing the quotient to be q = m allows us to write k % m simply as the remainder in the above equation:

k % m = r = k - nm,  where r < m

Therefore, k % m is equivalent to continuously subtracting m a total of n times (until r < m):

k % m = k - m - m - ... - m,  until r < m

Lets try hashing the key k = 91 with m = 24 = 16.

  91 = 0101 1011
- 16 = 0001 0000
----------------
  75 = 0100 1011
- 16 = 0001 0000
----------------
  59 = 0011 1011
- 16 = 0001 0000
----------------
  43 = 0010 1011
- 16 = 0001 0000
----------------
  27 = 0001 1011
- 16 = 0001 0000
----------------
  11 = 0000 1011

Thus, 91 % 24 = 11 is just the binary form of 91 with only the p=4 lowest bits remaining.


Important Distinction:

This pertains specifically to the division method of hashing. In fact, the converse is true for the multiplication method as stated in CLRS:

"An advantage of the multiplication method is that the value of m is not critical... We typically choose [m] to be a power of 2 since we can then easily implement the function on most computers."

Solution 3:

Nick is right that in general, the hash table size doesn't matter. However, in the special case where open addressing with double hashing is used (in which the interval between probes is computed by another hash function) then a prime number-sized hash table is best to ensure that all hash table entries are available for a new element (as Corkscreewe mentioned.)

Solution 4:

First off, it's not about picking a prime number. For your example, if you know your data set will be in the range 1 to 10,000, picking 127 or 128 won't make a difference bc it's a poor design choice.

Rather, it's better to pick a REALLY large prime like 3967 for your example so that each data will have its own unique key/value pair. You just want to also minimize collisions. Picking 127 or 128 for your example won't make a difference bc all 127/128 buckets will be uniformly filled (this is bad and will degrade the insertion and lookup run time O(1) to O(n)) as opposed to 3967 (which will preserve the O(1) run times)

EDIT #4

The design of the "hash function" is somewhat of a black art. It can be highly influenced by the data that's intended to be stored in the hashing-based data structure, so the discussion on a sensible hashing function can often stray into a discussion about specific inputs.

As why primes are "preferred", one has to consider an "adversary" analysis, that is suppose I designed a general hashing-based data structure, how would it perform given the worst input from an adversary. Since performance is dictated by hashing collisions the question becomes what's the hash to use that minimizes collision in the worst condition. One such condition is when the input are always numbers divisible by some integer, say 4. If you use N = 128 then any number divisible by 4 mod 128 is still divisible by 4, which means only buckets 4, 8, 12, ... are always ever used, resulting in 25% utilization of the data structure. Primes effectively reduces the likelihood of such scenario occurring, with numbers > N.

Solution 5:

If you have a perfect hash function that has an even distribution, then it doesn't matter.