Why doesn't String's hashCode() cache 0?

I noticed in the Java 6 source code for String that hashCode only caches values other than 0. The difference in performance is exhibited by the following snippet:

public class Main{
   static void test(String s) {
      long start = System.currentTimeMillis();
      for (int i = 0; i < 10000000; i++) {
         s.hashCode();
      }
      System.out.format("Took %d ms.%n", System.currentTimeMillis() - start);
   }
   public static void main(String[] args) {
      String z = "Allocator redistricts; strict allocator redistricts strictly.";
      test(z);
      test(z.toUpperCase());
   }
}

Running this in ideone.com gives the following output:

Took 1470 ms.
Took 58 ms.

So my questions are:

  • Why doesn't String's hashCode() cache 0?
  • What is the probability that a Java string hashes to 0?
  • What's the best way to avoid the performance penalty of recomputing the hash value every time for strings that hash to 0?
  • Is this the best-practice way of caching values? (i.e. cache all except one?)

For your amusement, each line here is a string that hash to 0:

pollinating sandboxes
amusement & hemophilias
schoolworks = perversive
electrolysissweeteners.net
constitutionalunstableness.net
grinnerslaphappier.org
BLEACHINGFEMININELY.NET
WWW.BUMRACEGOERS.ORG
WWW.RACCOONPRUDENTIALS.NET
Microcomputers: the unredeemed lollipop...
Incentively, my dear, I don't tessellate a derangement.
A person who never yodelled an apology, never preened vocalizing transsexuals.

Solution 1:

You're worrying about nothing. Here's a way to think about this issue.

Suppose you have an application that does nothing but sit around hashing Strings all year long. Let's say it takes a thousand strings, all in memory, calls hashCode() on them repeatedly in round-robin fashion, a million times through, then gets another thousand new strings and does it again.

And suppose that the likelihood of a string's hash code being zero were, in fact, much greater than 1/2^32. I'm sure it is somewhat greater than 1/2^32, but let's say it's a lot worse than that, like 1/2^16 (the square root! now that's a lot worse!).

In this situation, you have more to benefit from Oracle's engineers improving how these strings' hash codes are cached than anyone else alive. So you write to them and ask them to fix it. And they work their magic so that whenever s.hashCode() is zero, it returns instantaneously (even the first time! a 100% improvement!). And let's say that they do this without degrading the performance at all for any other case.

Hooray! Now your app is... let's see... 0.0015% faster!

What used to take an entire day now takes only 23 hours, 57 minutes and 48 seconds!

And remember, we set up the scenario to give every possible benefit of the doubt, often to a ludicrous degree.

Does this seem worth it to you?

EDIT: since posting this a couple hours ago, I've let one of my processors run wild looking for two-word phrases with zero hash codes. So far it's come up with: bequirtle zorillo, chronogrammic schtoff, contusive cloisterlike, creashaks organzine, drumwood boulderhead, electroanalytic exercisable, and favosely nonconstruable. This is out of about 2^35 possibilities, so with perfect distribution we'd expect to see only 8. Clearly by the time it's done we'll have a few times that many, but not outlandishly more. What's more significant is that I've now come up with a few interesting band names/album names! No fair stealing!

Solution 2:

It uses 0 to indicate "I haven't worked out the hashcode yet". The alternative would be to use a separate Boolean flag, which would take more memory. (Or to not cache the hashcode at all, of course.)

I don't expect many strings hash to 0; arguably it would make sense for the hashing routine to deliberately avoid 0 (e.g. translate a hash of 0 to 1, and cache that). That would increase collisions but avoid rehashing. It's too late to do that now though, as the String hashCode algorithm is explicitly documented.

As for whether this is a good idea in general: it's an certainly efficient caching mechanism, and might (see edit) be even better with a change to avoid rehashing values which end up with a hash of 0. Personally I would be interested to see the data which led Sun to believe this was worth doing in the first place - it's taking up an extra 4 bytes for every string ever created, however often or rarely it's hashed, and the only benefit is for strings which are hashed more than once.

EDIT: As KevinB points out in a comment elsewhere, the "avoid 0" suggestion above may well have a net cost because it helps a very rare case, but requires an extra comparison for every hash calculation.

Solution 3:

I think there's something important that the other answers so far are missing: the zero value exists so that the hashCode-caching mechanism works robustly in a multi-threaded environment.

If you had two variables, like cachedHashCode itself and an isHashCodeCalculated boolean to indicate whether cachedHashCode had been calculated, you'd need thread synchronization for things to work in a multithreaded environment. And synchronization would be bad for performance, especially since Strings are very commonly reused in multiple threads.

My understanding of the Java memory model is a little sketchy, but here's roughly what's going on:

  1. When multiple threads access a variable (like the cached hashCode), there's no guarantee that each thread will see the latest value. If a variable starts on zero, then A updates it (sets it to a non-zero value), then thread B reads it shortly afterwards, thread B could still see the zero value.

  2. There's another problem with accessing shared values from multiple threads (without synchronization) - you can end up trying to use an object that's only been partly initialized (constructing an object is not an atomic process). Multi-threaded reads and writes of 64-bit primitives like longs and doubles are not necessarily atomic either, so if two threads try to read and change the value of a long or a double, one thread can end up seeing something weird and partially set. Or something like that anyway. There are similar problems if you try to use two variables together, like cachedHashCode and isHashCodeCalculated - a thread can easily come along and see the latest version of one of those variables, but an older version of another.

  3. The usual way to get around these multi-threading issues is to use synchronization. For example, you could put all access to the cached hashCode inside a synchronized block, or you could use the volatile keyword (although be careful with that because the semantics are a little confusing).

  4. However, synchronization slows things down. Bad idea for something like a string hashCode. Strings are very often used as keys in HashMaps, so you need the hashCode method to perform well, including in multi-threaded environments.

  5. Java primitives that are 32-bits or less, like int, are special. Unlike, say, a long (64-bit value), you can be sure that you will never read a partially initialized value of an int (32 bits). When you read an int without synchronization, you can't be sure that you'll get the latest set value, but you can be sure that the value you do get is a value that has explicitly been set at some point by your thread or another thread.

The hashCode caching mechanism in java.lang.String is set up to rely on point 5 above. You might understand it better by looking at the source of java.lang.String.hashCode(). Basically, with multiple threads calling hashCode at once, hashCode might end up being calculated multiple times (either if the calculated value is zero or if multiple threads call hashCode at once and both see a zero cached value), but you can be sure that hashCode() will always return the same value. So it's robust, and it's performant too (because there's no synchronization to act as a bottleneck in multi-threaded environments).

Like I said, my understanding of the Java memory model is a little sketchy, but I'm pretty sure I've got the gist of the above right. Ultimately it's a very clever idiom for caching the hashCode without the overhead of synchronization.

Solution 4:

0 isn't cached as the implementation interprets a cached value of 0 as "cached value not yet initialised". The alternative would have been to use a java.lang.Integer, whereby null implied that the value was not yet cached. However, this would have meant an additional storage overhead.

Regarding the probability of a String's hash code being computed as 0 I would say the probability is quite low and can happen in the following cases:

  • The String is empty (although recomputing this hash code each time is effectively O(1)).
  • An overflow occurs whereby the final computed hash code is 0 (e.g. Integer.MAX_VALUE + h(c1) + h(c2) + ... h(cn) == 0).
  • The String contains only Unicode character 0. Very unlikely as this is a control character with no meaning apart from in the "paper tape world" (!):

From Wikipedia:

Code 0 (ASCII code name NUL) is a special case. In paper tape, it is the case when there are no holes. It is convenient to treat this as a fill character without meaning otherwise.