Understanding strange Java hash function
Here is some code and the sample output:
public static void main ( String[] args ) {
int h = 0xffffffff;
int h1 = h >>> 20;
int h2 = h >>> 12;
int h3 = h1 ^ h2;
int h4 = h ^ h3;
int h5 = h4 >>> 7;
int h6 = h4 >>> 4;
int h7 = h5 ^ h6;
int h8 = h4 ^ h7;
printBin ( h );
printBin ( h1 );
printBin ( h2 );
printBin ( h3 );
printBin ( h4 );
printBin ( h5 );
printBin ( h6 );
printBin ( h7 );
printBin ( h8 );
}
static void printBin ( int h ) {
System.out.println ( String.format ( "%32s",
Integer.toBinaryString ( h ) ).replace ( ' ', '0' ) );
}
Which prints:
11111111111111111111111111111111
00000000000000000000111111111111
00000000000011111111111111111111
00000000000011111111000000000000
11111111111100000000111111111111
00000001111111111110000000011111
00001111111111110000000011111111
00001110000000001110000011100000
11110001111100001110111100011111
So, the code breaks down the hash function into steps so that you can see what is happening. The first shift of 20 positions xor with the second shift of 12 positions creates a mask that can flip 0 or more of the bottom 20 bits of the int. So you can get some randomness inserted into the bottom bits that makes use of the potentially better distributed higher bits. This is then applied via xor to the original value to add that randomness to the lower bits. The second shift of 7 positions xor the shift of 4 positions creates a mask that can flip 0 or more of the bottom 28 bits, which brings some randomness again to the lower bits and to some of the more significant ones by capitalizing on the prior xor which already addressed some of the distribution at the lower bits. The end result is a smoother distribution of bits through the hash value.
Since the hashmap in java computes the bucket index by combining the hash with the number of buckets you need to have an even distribution of the lower bits of the hash value to spread the entries evenly into each bucket.
As to proving the statement that this bounds the number of collisions, that one I don't have any input on. Also, see here for some good information on building hash functions and a few details on why the xor of two numbers tends towards random distribution of bits in the result.
>>>
is a bitshift with zero fill.
^
is an XOR.
XOR
is also called exclusive or--it is a math operator that combines two numbers. See http://en.wikipedia.org/wiki/Exclusive_or
A right bitshift by n
is like dropping the n
lowest bits off of the number. So if the number is 00010111
, and you shifted it right by 1, you'd get 00001011
.
Here's an article that discusses integer hash functions and some of the considerations to which they are designed. It's not very detailed, but the main point is this:
the operations must use a chain of computations to achieve avalanche. Avalanche means that a single bit of difference in the input will make about 1/2 of the output bits be different.
Basically, the goal is for the supplemental hash function to remove any regularities in the input, because those could cause the hash table to degenerate.