Bad idea to use String key in HashMap?
Solution 1:
Developers do not have to work around the issue of hash collisions in HashMap in order to achieve program correctness.
There are a couple of key things to understand here:
- Collisions are an inherent feature of hashing, and they have to be. The number of possible values (Strings in your case, but it applies to other types as well) is vastly bigger than the range of integers.
- Every usage of hashing has a way to handle collisions, and the Java Collections (including HashMap) is no exception.
- Hashing is not involved in equality testing. It is true that equal objects must have equal hashcodes, but the reverse is not true: many values will have the same hashcode. So don't try using a hashcode comparison as a substitute for equality. Collections don't. They use hashing to select a sub-collection (called a bucket in the Java Collections world), but they use .equals() to actually check for equality.
- Not only do you not have to worry about collisions causing incorrect results in a collection, but for most applications, you also *usually* don't have to worry about performance - Java hashed Collections do a pretty good job of managing hashcodes.
- Better yet, for the case you asked about (Strings as keys), you don't even have to worry about the hashcodes themselves, because Java's String class generates a pretty good hashcode. So do most of the supplied Java classes.
Some more detail, if you want it:
The way hashing works (in particular, in the case of hashed collections like Java's HashMap, which is what you asked about) is this:
The HashMap stores the values you give it in a collection of sub-collections, called buckets. These are actually implemented as linked lists. There are a limited number of these: iirc, 16 to start by default, and the number increases as you put more items into the map. There should always be more buckets than values. To provide one example, using the defaults, if you add 100 entries to a HashMap, there will be 256 buckets.
Every value which can be used as a key in a map must be able to generate an integer value, called the hashcode.
The HashMap uses this hashcode to select a bucket. Ultimately, this means taking the integer value
modulo
the number of buckets, but before that, Java's HashMap has an internal method (calledhash()
), which tweaks the hashcode to reduce some known sources of clumping.When looking up a value, the HashMap selects the bucket, and then searches for the individual element by a linear search of the linked list, using
.equals()
.
So: you don't have to work around collisions for correctness, and you usually don't have to worry about them for performance, and if you're using native Java classes (like String), you don't have to worry about generating the hashcode values either.
In the case where you do have to write your own hashcode method (which means you've written a class with a compound value, like a first name/last name pair), things get slightly more complicated. It's quite possible to get it wrong here, but it's not rocket science. First, know this: the only thing you must do in order to assure correctness is to assure that equal objects yield equal hashcodes. So if you write a hashcode() method for your class, you must also write an equals() method, and you must examine the same values in each.
It is possible to write a hashcode() method which is bad but correct, by which I mean that it would satisfy the "equal objects must yield equal hashcodes" constraint, but still perform very poorly, by having a lot of collisions.
The canonical degenerate worst case of this would be to write a method which simply returns a constant value (e.g., 3) for all cases. This would mean that every value would be hashed into the same bucket.
It would still work, but performance would degrade to that of a linked list.
Obviously, you won't write such a terrible hashcode() method. If you're using a decent IDE, it's capable of generating one for you. Since StackOverflow loves code, here's the code for the firstname/lastname class above.
public class SimpleName {
private String firstName;
private String lastName;
public SimpleName(String firstName, String lastName) {
super();
this.firstName = firstName;
this.lastName = lastName;
}
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result
+ ((firstName == null) ? 0 : firstName.hashCode());
result = prime * result
+ ((lastName == null) ? 0 : lastName.hashCode());
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
SimpleName other = (SimpleName) obj;
if (firstName == null) {
if (other.firstName != null)
return false;
} else if (!firstName.equals(other.firstName))
return false;
if (lastName == null) {
if (other.lastName != null)
return false;
} else if (!lastName.equals(other.lastName))
return false;
return true;
}
}
Solution 2:
I direct you to the answer here. While it is not a bad idea to use strings( @CPerkins explained why, perfectly), storing the values in a hashmap with integer keys is better, since it is generally quicker (although unnoticeably) and has lower chance (actually, no chance) of collisions.
See this chart of collisions using 216553 keys in each case, (stolen from this post, reformatted for our discussion)
Hash Lowercase Random UUID Numbers ============= ============= =========== ============== Murmur 145 ns 259 ns 92 ns 6 collis 5 collis 0 collis FNV-1a 152 ns 504 ns 86 ns 4 collis 4 collis 0 collis FNV-1 184 ns 730 ns 92 ns 1 collis 5 collis 0 collis* DBJ2a 158 ns 443 ns 91 ns 5 collis 6 collis 0 collis*** DJB2 156 ns 437 ns 93 ns 7 collis 6 collis 0 collis*** SDBM 148 ns 484 ns 90 ns 4 collis 6 collis 0 collis** CRC32 250 ns 946 ns 130 ns 2 collis 0 collis 0 collis Avg Time per key 0.8ps 2.5ps 0.44ps Collisions (%) 0.002% 0.002% 0%
Of course, the number of integers is limited to 2^32, where as there is no limit to the number of strings (and there is no theoretical limit to the amount of keys that can be stored in a HashMap
). If you use a long
(or even a float
), collisions will be inevitable, and therefore no "better" than a string. However, even despite hash collisions, put()
and get()
will always put/get the correct key-value pair (See edit below).
In the end, it really doesn't matter, so use whatever is more convenient. But if convenience makes no difference, and you do not intend to have more than 2^32 entries, I suggest you use ints
as keys.
EDIT
While the above is definitely true, NEVER use "StringKey".hashCode() to generate a key in place of the original String
key for performance reasons- 2 different strings can have the same hashCode, causing overwriting on your put()
method. Java's implementation of HashMap
is smart enough to handle strings (any type of key, actually) with the same hashcode automatically, so it is wise to let Java handle these things for you.
Solution 3:
I strongly suspect that the HashMap.put
method does not determine whether the key is the same by just looking at String.hashCode
.
There is definitely going to be a chance of a hash collision, so one would expect that the String.equals
method will also be called to be sure that the String
s are truly equal, if there is indeed a case where the two String
s have the same value returned from hashCode
.
Therefore, the new key String
would only be judged to be the same key String
as one that is already in the HashMap
if and only if the value returned by hashCode
is equal, and the equals
method returns true
.
Also to add, this thought would also be true for classes other than String
, as the Object
class itself already has the hashCode
and equals
methods.
Edit
So, to answer the question, no, it would not be a bad idea to use a String
for a key to a HashMap
.