What's wrong with using associativity by compilers?

Sometimes associativity can be used to loose data dependencies and I was curious how much it can help. I was rather surprised to find out that I can nearly get a speed-up factor of 4 by manually unrolling a trivial loop, both in Java (build 1.7.0_51-b13) and in C (gcc 4.4.3).

So either I'm doing something pretty stupid or the compilers ignore a powerful tool. I started with

int a = 0;
for (int i=0; i<N; ++i) a = M1 * a + t[i];

which computes something close to String.hashCode() (set M1=31 and use a char[]). The computation is pretty trivial and for t.length=1000 takes about 1.2 microsecond on my i5-2400 @ 3.10GHz (both in Java and C).

Observe that each two steps a gets multiplied by M2 = M1*M1 and added something. This leads to this piece of code

int a = 0;
for (int i=0; i<N; i+=2) {
    a = M2 * a + (M1 * t[i] + t[i+1]); // <-- note the parentheses!
}
if (i < len) a = M1 * a + t[i]; // Handle odd length.

This is exactly twice as fast as the first snippet. Strangely, leaving out the parentheses eats 20% of the speed-up. Funnily enough, this can be repeated and a factor of 3.8 can be achieved.

Unlike java, gcc -O3 chooses not to unroll the loop. It's wise choice since it wouldn't help anyway (as -funroll-all-loops shows).

So my question1 is: What prevents such an optimization?

Googling didn't work, I got "associative arrays" and "associative operators" only.

Update

I polished up my benchmark a little bit and can provide some results now. There's no speedup beyond unrolling 4 times, probably because of multiplication and addition together taking 4 cycles.

Update 2

As Java already unrolls the loop, all the hard work is done. What we get is something like

...pre-loop
for (int i=0; i<N; i+=2) {
    a2 = M1 * a + t[i];
    a = M1 * a2 + t[i+1];
}
...post-loop

where the interesting part can be rewritten like

    a = M1 * ((M1 * a) + t[i]) + t[i+1]; // latency 2mul + 2add

This reveals that there are 2 multiplications and 2 additions, all of them to be performed sequentially, thus needing 8 cycles on a modern x86 CPU. All we need now is some primary school math (working for ints even in case of overflow or whatever, but not applicable to floating point).

    a = ((M1 * (M1 * a)) + (M1 * t[i])) + t[i+1]; // latency 2mul + 2add

So far we gained nothing, but it allows us to fold the constants

    a = ((M2 * a) + (M1 * t[i])) + t[i+1]; // latency 1mul + 2add

and gain even more by regrouping the sum

    a = (M2 * a) + ((M1 * t[i]) + t[i+1]); // latency 1mul + 1add

Here is how I understand your two cases: In the first case, you have a loop that takes N steps; in the second case you manually merged two consecutive iterations of the first case into one, so you only need to do N/2 steps in the second case. Your second case runs faster and you are wondering why the dumb compiler couldn't do it automatically.

There is nothing that would prevent the compiler from doing such an optimization. But please note that this re-write of the original loop leads to larger executable size: You have more instructions inside the for loop and the additional if after the loop.
If N=1 or N=3, the original loop is likely to be faster (less branching and better caching/prefetching/branch prediction). It made things faster in your case but it may make things slower in other cases. It is not a clear cut whether it is worth doing this optimization and it can be highly nontrivial to implement such an optimization in a compiler.

By the way, what you have done is very similar to loop vectorization but in your case, you did the parallel step manually and plugged-in the result. Eric Brumer's Compiler Confidential talk will give you insight why rewriting loops in general is tricky and what drawbacks / disadvantages there are (larger executable size, potentially slower in some cases). So compiler writers are very well aware of this optimization possibility and are actively working on it but it is highly nontrivial in general and can also make things slower.


Please try something for me:

int a = 0;
for (int i=0; i<N; ++i) 
  a = ((a<<5) - a) + t[i];

assuming M1=31. In principle, the compiler should be smart enough to rewrite 31*a into (a<<5)-a but I am curious if it really does that.