What is the significance of load factor in HashMap?

The documentation explains it pretty well:

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

As with all performance optimizations, it is a good idea to avoid optimizing things prematurely (i.e. without hard data on where the bottlenecks are).


Default initial capacity of the HashMap takes is 16 and load factor is 0.75f (i.e 75% of current map size). The load factor represents at what level the HashMap capacity should be doubled.

For example product of capacity and load factor as 16 * 0.75 = 12. This represents that after storing the 12th key – value pair into the HashMap , its capacity becomes 32.


Actually, from my calculations, the "perfect" load factor is closer to log 2 (~ 0.7). Although any load factor less than this will yield better performance. I think that .75 was probably pulled out of a hat.

Proof:

Chaining can be avoided and branch prediction exploited by predicting if a bucket is empty or not. A bucket is probably empty if the probability of it being empty exceeds .5.

Let s represent the size and n the number of keys added. Using the binomial theorem, the probability of a bucket being empty is:

P(0) = C(n, 0) * (1/s)^0 * (1 - 1/s)^(n - 0)

Thus, a bucket is probably empty if there are less than

log(2)/log(s/(s - 1)) keys

As s reaches infinity and if the number of keys added is such that P(0) = .5, then n/s approaches log(2) rapidly:

lim (log(2)/log(s/(s - 1)))/s as s -> infinity = log(2) ~ 0.693...

What is load factor ?

The amount of capacity which is to be exhausted for the HashMap to increase its capacity ?

Why load factor ?

Load factor is by default 0.75 of the initial capacity (16) therefore 25% of the buckets will be free before there is an increase in the capacity & this makes many new buckets with new hashcodes pointing to them to exist just after the increase in the number of buckets.

Now why should you keep many free buckets & what is the impact of keeping free buckets on the performance ?

If you set the loading factor to say 1.0 then something very interesting might happen.

Say you are adding an object x to your hashmap whose hashCode is 888 & in your hashmap the bucket representing the hashcode is free , so the object x gets added to the bucket, but now again say if you are adding another object y whose hashCode is also 888 then your object y will get added for sure BUT at the end of the bucket (because the buckets are nothing but linkedList implementation storing key,value & next) now this has a performance impact ! Since your object y is no longer present in the head of the bucket if you perform a lookup the time taken is not going to be O(1) this time it depends on how many items are there in the same bucket. This is called hash collision by the way & this even happens when your loading factor is less than 1.

Correlation between performance , hash collision & loading factor ?

Lower load factor = more free buckets = less chances of collision = high performance = high space requirement.

Correct me if i am wrong somewhere.