Curious Properties of 33
Because my explanation has so many words, I'll start with my question and then you can read the explanation if you need to:
The Bernstein Hash method uses the number 33 as a multiplier. From what I've read Bernstein himself has no reasonable explanation as to why 33 has such useful properties. I'm wondering if the mathematics community has any theories on the matter.
The explanation:
I'm a software engineer and I was recently working on a blog post about a hash function I was writing. In the process of designing my hash function, I looked at a lot of implementations of other hash functions.
For non-programmers, the gist is that the hashcodes produced by objects should be well distributed throughout the range of values in a 32 bit integer (-2,147,483,648 through 2,147,483,647).
Let's say I have the string "ABCD." A hash function would loop through each character, get the ASCII value of it, do something to it and aggregate it into a composite hashcode.
For example, in a lot of implementations, they'd take a hashcode initialized to a large prime, multiply it by another prime and the XOR it with the value of 'A' which is 65. Then, they'd take that and multiply it by the same prime and XOR that with the value of 'B'. They'd do this until the end of the string is reached.
I found a clever implementation in the Java framework code that loops over each item and effectively applies this: $i = ((i \ll 5) - i) \oplus j$. I was confused at first until I worked out that $(i \ll 5) - i = 31i$. For a computer, bit shifting is faster than multiplying so this is a clever way to multiply by a prime number.
So, I looked in Microsoft's .Net framework and I found that they do it a little differently. They use $(i \ll 5) + i$ instead! I couldn't figure out for the life of me why they used 33 instead of 31 because from what I understand multiplying by prime numbers is the foundation of many hashing and cryptographic functions.
I found out that this technique is called the Bernstein Hash and that Bernstein himself doesn't know why 33 produces such a good distribution of values as a multiplier in hashing functions.
Solution 1:
I've always assumed that the factor 33 (which amounts to a 5-bit shift plus a addition, as you note) was chosen because 5 bits is roughly the significant content of (ASCII) text, hence the "spreading" is rather optimal for typical textual keys. I doubt something more can be said from the strictly mathematical point of view.
"from what I understand multiplying by prime numbers is the foundation of many hashing and cryptographic functions." That sounds rather vague to me: for example, for linear congruential generators, the multiplicator is frequently not prime. Empirically, it's seen that 31 or 33 make very little difference. Bear in mind that hashing functions usually relax the strict requirements of cryptographic functions, in favour of simplicity and performance.
Some reasonable heuristics and experimental results are given here.
See also this old discussion about the 33 vs 31 thing.
Solution 2:
I'm not really familiar with this particular hash function, but I decided to try going for the source, which in this case seems to be the Usenet newsgroup comp.lang.c. In this message, Phong Vo[drew] points out that one advantage of 33 over 31 is that it is congruent to 1 modulo 4, so that $k \mapsto 33*k + c \pmod{2^n}$, for a given $c \ne 0$, is a cyclic permutation on $\mathbb Z / 2^n \mathbb Z$. In particular, this means that, if you feed the hash function constant (non-zero) input, it behaves like a full-period LCRNG.
Of course, this doesn't really prove anything; it's just suggestive. Also note that this explanation doesn't directly apply to the version using XOR instead of addition to mix in the data. (Both seem to have been recommended by Bernstein at various times.) In practice, of course, I wouldn't expect much difference between the XOR and addition variants anyway; they are very similar operations, and either should mix in the input just about as well.
Like leonbloy, I also suspect that part of the reason why 33 (or 31) works so well in practice may be that the typical input to such functions is often ASCII text, which tends to have most of its entropy in the lowest five bits. Thus, shifting the hash value left by five bits per round should diffuse this entropy particularly efficiently across all its bits.
Also, it's worth noting that this function was never intended to be a particularly good hash; it was designed to be a fast, yet still acceptably good, hash for applications where the speed of hashing strings is a limiting factor.
Solution 3:
Ilmari and leonbloy have already said quite a bit and provided pertinent links; here are some more thoughts:
First, about the primes: the accepted answer to this question at stackoverflow sounds right to me. It's not that either the multiplier or the hash table size need to be primes; they should just be coprime. If you're writing the code for both, there's no reason for them to be prime, but if you're writing the code for only one of them (which often happens in practice), you might want to choose primes, or at least numbers with large prime factors, to reduce the chances of a "prime collision" between the multiplier and the hash table size.
Second, about hashcodes being "well distributed" or "optimally spread": That's only part of the story. For long strings, where injectivity is not an option, random spread is good, but for short strings, injectivity is even better than random spread. Ideally, a multiplier-based hash function should have a random spread for initial inputs but be injective with respect to the last few inputs.
ASCII codes for letters of the same case differ only in the low $5$ bits. So for strings ending in lowercase letters and $32$-bit hash values, you can get injectivity with respect to the last six characters if the multiplier is greater than $32$, whereas for multipliers less than $32$ the lowest bit of the penultimate input influences at most the last $5$ bits and thus its contribution isn't linearly independent from the contributions of all the other bits. In that case you're only using half the available hash values, and the values for the last six inputs form pairs that are mapped to the same hash values.
If the multiplier is greater than $32$ and the multiplication doesn't overflow, injectivity with respect to the last $6$ lowercase characters is guaranteed. This is the case up to a multiplier of $47$ (since $2^4\cdot47^5\lesssim2^{32}$), but even beyond that, the contributions from different bits are usually linearly independent (there's a good chance for $30$ random vectors in $\mathbb F_2^{32}$ to be linearly independent); the first odd multiplier for which they aren't is $95$.
So it's not immediately clear why $33$ should be better for ASCII strings than any other odd multiplier greater than $32$ with remainder $1$ mod $4$ (see Ilmari's answer). However, more generally speaking, the lower the multiplier, the more final inputs get mapped injectively before they start getting scrambled by the overflow. So it could well be that the $33$ is a good trade-off between getting the lower $5$ bits of ASCII characters out of each other's way (which requires $f\ge32$) and delaying the onset of scrambling by overflow (which requires low $f$).
A note on the difference between the XOR variant and the one where $j$ is added instead: My considerations above about linear independence apply only to the XOR case, in which we can treat the hash value as a vector in $\mathbb F_2^{32}$ and the hash function as an affine transform on that space. In the case of addition, we don't need the multiplier to be greater than $32$ for the hash to be injective with respect to the last six inputs, just greater than the number of lowercase letters, which is $26$, so in that case there's no particular advantage to $33$. That's borne out by the experimental results that leonbloy linked to, where $31$ (labeled "K&R") even does slightly better on words than $33$ (labeled "Bernstein"); you can see near the top of the page or by following the Bernstein link that these results refer to the case with addition. (This is consistent with the idea that lower multipliers are generally better as long as they map the lowercase letters injectively, but it seems to cast some doubt on Ilmari's point about the remainder mod $4$?)
Examples where the combination of $31$ and XOR fails to be injective already occur with $3$ letters; for instance, "rox" and "rng" are both mapped to 1A607
.
This is all just about the advantages of a multiplier with respect to avoiding hash collisions. Often the speed of computing the hash function is at least as important as avoiding collisions, and in that respect $33$ has the advantage over most other multipliers (though not over $31$) that it can easily be computed with a shift and an addition, as you explained.