What's wrong with using associativity by compilers?
Sometimes associativity can be used to loose data dependencies and I was curious how much it can help. I was rather surprised to find out that I can nearly get a speed-up factor of 4 by manually unrolling a trivial loop, both in Java (build 1.7.0_51-b13) and in C (gcc 4.4.3).
So either I'm doing something pretty stupid or the compilers ignore a powerful tool. I started with
int a = 0;
for (int i=0; i<N; ++i) a = M1 * a + t[i];
which computes something close to String.hashCode()
(set M1=31
and use a char[]
). The computation is pretty trivial and for t.length=1000
takes about 1.2 microsecond on my i5-2400 @ 3.10GHz (both in Java and C).
Observe that each two steps a
gets multiplied by M2 = M1*M1
and added something. This leads to this piece of code
int a = 0;
for (int i=0; i<N; i+=2) {
a = M2 * a + (M1 * t[i] + t[i+1]); // <-- note the parentheses!
}
if (i < len) a = M1 * a + t[i]; // Handle odd length.
This is exactly twice as fast as the first snippet. Strangely, leaving out the parentheses eats 20% of the speed-up. Funnily enough, this can be repeated and a factor of 3.8 can be achieved.
Unlike java, gcc -O3
chooses not to unroll the loop. It's wise choice since it wouldn't help anyway (as -funroll-all-loops
shows).
So my question1 is: What prevents such an optimization?
Googling didn't work, I got "associative arrays" and "associative operators" only.
Update
I polished up my benchmark a little bit and can provide some results now. There's no speedup beyond unrolling 4 times, probably because of multiplication and addition together taking 4 cycles.
Update 2
As Java already unrolls the loop, all the hard work is done. What we get is something like
...pre-loop
for (int i=0; i<N; i+=2) {
a2 = M1 * a + t[i];
a = M1 * a2 + t[i+1];
}
...post-loop
where the interesting part can be rewritten like
a = M1 * ((M1 * a) + t[i]) + t[i+1]; // latency 2mul + 2add
This reveals that there are 2 multiplications and 2 additions, all of them to be performed sequentially, thus needing 8 cycles on a modern x86 CPU. All we need now is some primary school math (working for int
s even in case of overflow or whatever, but not applicable to floating point).
a = ((M1 * (M1 * a)) + (M1 * t[i])) + t[i+1]; // latency 2mul + 2add
So far we gained nothing, but it allows us to fold the constants
a = ((M2 * a) + (M1 * t[i])) + t[i+1]; // latency 1mul + 2add
and gain even more by regrouping the sum
a = (M2 * a) + ((M1 * t[i]) + t[i+1]); // latency 1mul + 1add
Here is how I understand your two cases: In the first case, you have a loop that takes N steps; in the second case you manually merged two consecutive iterations of the first case into one, so you only need to do N/2 steps in the second case. Your second case runs faster and you are wondering why the dumb compiler couldn't do it automatically.
There is nothing that would prevent the compiler from doing such an optimization. But please note that this re-write of the original loop leads to larger executable size: You have more instructions inside the for
loop and the additional if
after the loop.
If N=1 or N=3, the original loop is likely to be faster (less branching and better caching/prefetching/branch prediction). It made things faster in your case but it may make things slower in other cases. It is not a clear cut whether it is worth doing this optimization and it can be highly nontrivial to implement such an optimization in a compiler.
By the way, what you have done is very similar to loop vectorization but in your case, you did the parallel step manually and plugged-in the result. Eric Brumer's Compiler Confidential talk will give you insight why rewriting loops in general is tricky and what drawbacks / disadvantages there are (larger executable size, potentially slower in some cases). So compiler writers are very well aware of this optimization possibility and are actively working on it but it is highly nontrivial in general and can also make things slower.
Please try something for me:
int a = 0;
for (int i=0; i<N; ++i)
a = ((a<<5) - a) + t[i];
assuming M1=31
. In principle, the compiler should be smart enough to rewrite 31*a
into (a<<5)-a
but I am curious if it really does that.