Why does String.hashCode() in Java have many conflicts? [closed]

Solution 1:

I just hashed 58 thousand English language words (found here), both all-lowercase and also with the first letter capitalized . Know how many collided? Two: "Siblings" and "Teheran" (an alternate spelling of "Tehran").

Just like you, I took a subdomain (in my case a likely one) of possible strings and analyzed the hashCode collision rate for it, and found it to be exemplary. Who is to say that your arbitrary subdomain of possible strings is a better choice to optimize for than mine?

The people that wrote this class had to do so knowing that they couldn't predict (nor therefore optimize) the subdomain in which their users would be using Strings as keys. So they chose a hash function which distributes evenly over the entire domain of strings.

If you're interested, here's my code:

Map<Integer, List<String>> collisions = Files.lines(Paths.get(System.getProperty("user.home")+ "/corncob_lowercase.txt"))
    .flatMap(word -> Stream.of(word, word.substring(0, 1).toUpperCase() + word.substring(1)))
    .collect(Collectors.groupingBy(String::hashCode))
    .entrySet()
    .stream()
    .filter(e -> e.getValue().size() > 1)
    .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));

System.out.printf("Number of collisions: %d%n", collisions.size());
collisions.forEach((hash, words) -> System.out.printf("%d: %s%n", hash, words));

Edit

By the way, if you're curious the same test with your hash function had 13 collisions compared to String.hashCode's 1.

Solution 2:

I'm sorry, but we need to throw some cold water on this idea.

  1. Your analysis is way too simplistic. You seem to have cherry-picked a subset of Strings that is designed to prove your point. This is not evidence that the number of collisions is (statistically) higher than expected across the domain of all strings.

  2. Nobody in their right mind would expect String.hashCode to be highly collision free1. It is simply not designed with that in mind. (If you want highly collision free hashing, then use a crypto hash algorithm ... and pay the cost.) String.hashCode() is designed to be reasonably good across the domain of all Strings ... and fast.

  3. Assuming that you could state a stronger case, this is not the place to state it. You need to raise this issue with the people who matter - Oracle's Java engineering team.

  4. The current algorithm for String::hashCode has been part of the javadoc specification for String since Java 1.2. (And the algorithm almost certainly goes back to Java 1.0 and earlier.) If the algorithm was changed, it would be a breaking change for some applications. This is probably enough kill the idea.

  5. The Java engineering team are going to weigh up the advantages of such a change against the costs of implementing it, for them, and for every user of Java.

The costs to users would include dealing with various potential performance and security issues, and the migration of any stored data that has dependencies on hashcodes. Or the cost of having more legacy applications that cannot realistically be ported to the latest version of Java.


1 - "Highly collision free hashing", is an idea / term that I pulled out of the air for the purposes of this answer. Sorry. However, the gist is that the probability of a hashcode collision for 2 strings should be independent of how related they are. So for instance "AA" and "bz" are related by virtue of having the same length. Obviously, this idea needs more thought. And it is also obvious that "relatedness" in the sense I'm talking about is not measurable ... sort of like Kolmogorov Complexity.)