Why is pow(int, int) so slow?
I've been working on a few project Euler exercises to improve my knowledge of C++.
I've written the following function:
int a = 0,b = 0,c = 0;
for (a = 1; a <= SUMTOTAL; a++)
{
for (b = a+1; b <= SUMTOTAL-a; b++)
{
c = SUMTOTAL-(a+b);
if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)
{
std::cout << "a: " << a << " b: " << b << " c: "<< c << std::endl;
std::cout << a * b * c << std::endl;
}
}
}
This computes in 17 milliseconds.
However, if I change the line
if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)
to
if (c == sqrt((a*a)+(b*b)) && b < c)
the computation takes place in 2 milliseconds. Is there some obvious implementation detail of pow(int, int)
that I'm missing which makes the first expression compute so much slower?
pow()
works with real floating-point numbers and uses under the hood the formula
pow(x,y) = e^(y log(x))
to calculate x^y
. The int
are converted to double
before calling pow
. (log
is the natural logarithm, e-based)
x^2
using pow()
is therefore slower than x*x
.
Edit based on relevant comments
- Using
pow
even with integer exponents may yield incorrect results (PaulMcKenzie) - In addition to using a math function with double type,
pow
is a function call (whilex*x
isn't) (jtbandes) - Many modern compilers will in fact optimize out pow with constant integer arguments, but this should not be relied upon.
You've picked one of the slowest possible ways to check
c*c == a*a + b*b // assuming c is non-negative
That compiles to three integer multiplications (one of which can be hoisted out of the loop). Even without pow()
, you're still converting to double
and taking a square root, which is terrible for throughput. (And also latency, but branch prediction + speculative execution on modern CPUs means that latency isn't a factor here).
Intel Haswell's SQRTSD instruction has a throughput of one per 8-14 cycles (source: Agner Fog's instruction tables), so even if your sqrt()
version keeps the FP sqrt execution unit saturated, it's still about 4 times slower than what I got gcc to emit (below).
You can also optimize the loop condition to break out of the loop when the b < c
part of the condition becomes false, so the compiler only has to do one version of that check.
void foo_optimized()
{
for (int a = 1; a <= SUMTOTAL; a++) {
for (int b = a+1; b < SUMTOTAL-a-b; b++) {
// int c = SUMTOTAL-(a+b); // gcc won't always transform signed-integer math, so this prevents hoisting (SUMTOTAL-a) :(
int c = (SUMTOTAL-a) - b;
// if (b >= c) break; // just changed the loop condition instead
// the compiler can hoist a*a out of the loop for us
if (/* b < c && */ c*c == a*a + b*b) {
// Just print a newline. std::endl also flushes, which bloats the asm
std::cout << "a: " << a << " b: " << b << " c: "<< c << '\n';
std::cout << a * b * c << '\n';
}
}
}
}
This compiles (with gcc6.2 -O3 -mtune=haswell
) to code with this inner loop. See the full code on the Godbolt compiler explorer.
# a*a is hoisted out of the loop. It's in r15d
.L6:
add ebp, 1 # b++
sub ebx, 1 # c--
add r12d, r14d # ivtmp.36, ivtmp.43 # not sure what this is or why it's in the loop, would have to look again at the asm outside
cmp ebp, ebx # b, _39
jg .L13 ## This is the loop-exit branch, not-taken until the end
## .L13 is the rest of the outer loop.
## It sets up for the next entry to this inner loop.
.L8:
mov eax, ebp # multiply a copy of the counters
mov edx, ebx
imul eax, ebp # b*b
imul edx, ebx # c*c
add eax, r15d # a*a + b*b
cmp edx, eax # tmp137, tmp139
jne .L6
## Fall-through into the cout print code when we find a match
## extremely rare, so should predict near-perfectly
On Intel Haswell, all these instructions are 1 uop each. (And the cmp/jcc pairs macro-fuse into compare-and-branch uops.) So that's 10 fused-domain uops, which can issue at one iteration per 2.5 cycles.
Haswell runs imul r32, r32
with a throughput of one iteration per clock, so the two multiplies inside the inner loop aren't saturating port 1 at two multiplies per 2.5c. This leaves room to soak up the inevitable resource conflicts from ADD and SUB stealing port 1.
We're not even close to any other execution-port bottlenecks, so the front-end bottleneck is the only issue, and this should run at one iteration per 2.5 cycles on Intel Haswell and later.
Loop-unrolling could help here to reduce the number of uops per check. e.g. use lea ecx, [rbx+1]
to compute b+1 for the next iteration, so we can imul ebx, ebx
without using a MOV to make it non-destructive.
A strength-reduction is also possible: Given b*b
we could try to compute (b-1) * (b-1)
without an IMUL. (b-1) * (b-1) = b*b - 2*b + 1
, so maybe we can do an lea ecx, [rbx*2 - 1]
and then subtract that from b*b
. (There are no addressing-modes that subtract instead of add. Hmm, maybe we could keep -b
in a register, and count up towards zero, so we could use lea ecx, [rcx + rbx*2 - 1]
to update b*b
in ECX, given -b
in EBX).
Unless you actually bottleneck on IMUL throughput, this might end up taking more uops and not be a win. It might be fun to see how well a compiler would do with this strength-reduction in the C++ source.
You could probably also vectorize this with SSE or AVX, checking 4 or 8 consecutive b
values in parallel. Since hits are really rare, you just check if any of the 8 had a hit and then sort out which one it was in the rare case that there was a match.
See also the x86 tag wiki for more optimization stuff.