Java HashMap performance optimization / alternative
Solution 1:
As many people pointed out the hashCode()
method was to blame. It was only generating around 20,000 codes for 26 million distinct objects. That is an average of 1,300 objects per hash bucket = very very bad. However if I turn the two arrays into a number in base 52 I am guaranteed to get a unique hash code for every object:
public int hashCode() {
// assume that both a and b are sorted
return a[0] + powerOf52(a[1], 1) + powerOf52(b[0], 2) + powerOf52(b[1], 3) + powerOf52(b[2], 4);
}
public static int powerOf52(byte b, int power) {
int result = b;
for (int i = 0; i < power; i++) {
result *= 52;
}
return result;
}
The arrays are sorted to ensure this methods fulfills the hashCode()
contract that equal objects have the same hash code. Using the old method the average number of puts per second over blocks of 100,000 puts, 100,000 to 2,000,000 was:
168350.17
109409.195
81344.91
64319.023
53780.79
45931.258
39680.29
34972.676
31354.514
28343.062
25562.371
23850.695
22299.22
20998.006
19797.799
18702.951
17702.434
16832.182
16084.52
15353.083
Using the new method gives:
337837.84
337268.12
337078.66
336983.97
313873.2
317460.3
317748.5
320000.0
309704.06
310752.03
312944.5
265780.75
275540.5
264350.44
273522.97
270910.94
279008.7
276285.5
283455.16
289603.25
Much much better. The old method tailed off very quickly while the new one keeps up a good throughput.
Solution 2:
One thing I notice in your hashCode()
method is that the order of the elements in the arrays a[]
and b[]
don't matter. Thus (a[]={1,2,3}, b[]={99,100})
will hash to the same value as (a[]={3,1,2}, b[]={100,99})
. Actually all keys k1
and k2
where sum(k1.a)==sum(k2.a)
and sum(k1.b)=sum(k2.b)
will result in collisions. I suggest assigning a weight to each position of the array:
hash = hash * 5381 + (c0*a[0] + c1*a[1]);
hash = hash * 5381 + (c0*b[0] + c1*b[1] + c3*b[2]);
where, c0
, c1
and c3
are distinct constants (you can use different constants for b
if necessary). That should even out things a bit more.
Solution 3:
To elaborate on Pascal: Do you understand how a HashMap works? You have some number of slots in your hash table. The hash value for each key is found, and then mapped to an entry in the table. If two hash values map to the same entry -- a "hash collision" -- HashMap builds a linked list.
Hash collisions can kill the performance of a hash map. In the extreme case, if all your keys have the same hash code, or if they have different hash codes but they all map to the same slot, then your hash map turns into a linked list.
So if you're seeing performance problems, the first thing I'd check is: Am I getting a random-looking distribution of hash codes? If not, you need a better hash function. Well, "better" in this case may mean "better for my particular set of data". Like, suppose you were working with strings, and you took the length of the string for the hash value. (Not how Java's String.hashCode works, but I'm just making up a simple example.) If your strings have widely varying lengths, from 1 to 10,000, and are fairly evenly distributed across that range, that this could be a very good hash function. But if your strings are all 1 or 2 characters, this would be a very bad hash function.
Edit: I should add: Every time you add a new entry, HashMap checks if this is a duplicate. When there's a hash collision, it has to compare the incoming key against every key that mapped to that slot. So in the worst case where everything hashes to a single slot, the second key is compared to the first key, the third key is compared to #1 and #2, the fourth key is compared to #1, #2, and #3, etc. By the time you get to key #1 million, you've done over a trillion compares.
@Oscar: Umm, I don't see how that's a "not really". It's more like a "let me clarify". But yes, it's true that if you make a new entry with the same key as an existing entry, that this overwrites the first entry. That's what I meant when I talked about looking for duplicates in the last paragraph: Whenever a key hashes to the same slot, HashMap must check if it's a duplicate of an existing key, or if they are just in the same slot by coincidence of the hash function. I don't know that that's the "whole point" of a HashMap: I would say that the "whole point" is that you can retrieve elements by key quickly.
But anyway, that doesn't affect the "whole point" that I was trying to make: When you have two keys -- yes, different keys, not the same key showing up again -- that map to the same slot in the table, HashMap builds a linked list. Then, because it has to check each new key to see if it is in fact a duplicate of an existing key, each attempt to add a new entry that maps to this same slot must chase the linked list examining each existing entry to see if this is a duplicate of a previously-seen key, or if it is a new key.
Update long after the original post
I just got an up-vote on this answer 6 years after posting which led me to re-read the question.
The hash function given in the question is not a good hash for 26 million entries.
It adds together a[0]+a[1] and b[0]+b[1]+b[2]. He says values of each byte range from 0 to 51, so that gives only (51*2+1)*(51*3+1)=15,862 possible hash values. With 26 million entries, this means an average of about 1639 entries per hash value. That is lots and lots of collisions, requiring lots and lots of sequential searches through linked lists.
The OP says that different orders within array a and array b should be considered equal, i.e. [[1,2],[3,4,5]].equals([[2,1],[5,3,4]]), and so to fulfill the contract they must have equal hash codes. Okay. Still, there are a lot more than 15,000 possible values. His second proposed hash function is much better, giving a broader range.
Though as someone else commented, it seems inappropriate for a hash function to change other data. It would make more sense to "normalize" the object when it is created, or to have the hash function work from copies of the arrays. Also, using a loop to calculate constants every time through the function is inefficient. As there are only four values here, I would have either written
return a[0]+a[1]*52+b[0]*52*52+b[1]*52*52*52+b[2]*52*52*52*52;
which would cause the compiler to perform the calculation once at compile time; or have 4 static constants defined in the class.
Also, the first draft at a hash function has several calculations that do nothing to add to the range of outputs. Note he first sets hash =503 than multiplies by 5381 before even considering values from the class. So ... in effect he adds 503*5381 to every value. What does this accomplish? Adding a constant to every hash value just burns cpu cycles without accomplishing anything useful. Lesson here: Adding complexity to a hash function is not the goal. The goal is to get a broad range of different values, not just to add complexity for the sake of complexity.
Solution 4:
I'd suggest a three-pronged approach:
Run Java with more memory:
java -Xmx256M
for example to run with 256 Megabytes. Use more if needed and you have lots of RAM.Cache your calculated hash values as suggested by another poster, so each object only calculates its hash value once.
Use a better hashing algorithm. The one you posted would return the same hash where a = {0, 1} as it would where a ={1, 0}, all else being equal.
Utilise what Java gives you for free.
public int hashCode() {
return 31 * Arrays.hashCode(a) + Arrays.hashCode(b);
}
I'm pretty sure this has much less chance of clashing than your existing hashCode method, although it depends on the exact nature of your data.
Solution 5:
My first idea is to make sure you're initializing your HashMap appropriately. From the JavaDocs for HashMap:
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
So if you're starting off with a too-small HashMap, then every time it needs to resize, all the hashes are recomputed... which might be what you're feeling when you get to the 2-3 million insertion point.