Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

Solution 1:

sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.

sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.

edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.

Solution 2:

This is also true for division. MULSS(a,RCPSS(b)) is way faster than DIVSS(a,b). In fact it's still faster even when you increase its precision with a Newton-Raphson iteration.

Intel and AMD both recommend this technique in their optimisation manuals. In applications which don't require IEEE-754 compliance, the only reason to use div/sqrt is code readability.

Solution 3:

Instead of supplying an answer, that actually might be incorrect (I'm also not going to check or argue about cache and other stuff, let's say they are identical) I'll try to point you to the source that can answer your question.
The difference might lie in how sqrt and rsqrt are computed. You can read more here http://www.intel.com/products/processor/manuals/. I'd suggest to start from reading about processor functions you are using, there are some info, especially about rsqrt (cpu is using internal lookup table with huge approximation, which makes it much simpler to get the result). It may seem, that rsqrt is so much faster than sqrt, that 1 additional mul operation (which isn't to costly) might not change the situation here.

Edit: Few facts that might be worth mentioning:
1. Once I was doing some micro optimalizations for my graphics library and I've used rsqrt for computing length of vectors. (instead of sqrt, I've multiplied my sum of squared by rsqrt of it, which is exactly what you've done in your tests), and it performed better.
2. Computing rsqrt using simple lookup table might be easier, as for rsqrt, when x goes to infinity, 1/sqrt(x) goes to 0, so for small x's the function values doesn't change (a lot), whereas for sqrt - it goes to infinity, so it's that simple case ;).

Also, clarification: I'm not sure where I've found it in books I've linked, but I'm pretty sure I've read that rsqrt is using some lookup table, and it should be used only, when the result doesn't need to be exact, although - I might be wrong as well, as it was some time ago :).

Solution 4:

There are a number of other answers to this already from a few years ago. Here's what the consensus got right:

  • The rsqrt* instructions compute an approximation to the reciprocal square root, good to about 11-12 bits.
  • It's implemented with a lookup table (i.e. a ROM) indexed by the mantissa. (In fact, it's a compressed lookup table, similar to mathematical tables of old, using adjustments to the low-order bits to save on transistors.)
  • The reason why it's available is that it is the initial estimate used by the FPU for the "real" square root algorithm.
  • There's also an approximate reciprocal instruction, rcp. Both of these instructions are a clue to how the FPU implements square root and division.

Here's what the consensus got wrong:

  • SSE-era FPUs do not use Newton-Raphson to compute square roots. It's a great method in software, but it would be a mistake to implement it that way in hardware.

The N-R algorithm to compute reciprocal square root has this update step, as others have noted:

x' = 0.5 * x * (3 - n*x*x);

That's a lot of data-dependent multiplications and one subtraction.

What follows is the algorithm that modern FPUs actually use.

Given b[0] = n, suppose we can find a series of numbers Y[i] such that b[n] = b[0] * Y[0]^2 * Y[1]^2 * ... * Y[n]^2 approaches 1. Then consider:

x[n] = b[0] * Y[0] * Y[1] * ... * Y[n]
y[n] = Y[0] * Y[1] * ... * Y[n]

Clearly x[n] approaches sqrt(n) and y[n] approaches 1/sqrt(n).

We can use the Newton-Raphson update step for reciprocal square root to get a good Y[i]:

b[i] = b[i-1] * Y[i-1]^2
Y[i] = 0.5 * (3 - b[i])

Then:

x[0] = n Y[0]
x[i] = x[i-1] * Y[i]

and:

y[0] = Y[0]
y[i] = y[i-1] * Y[i]

The next key observation is that b[i] = x[i-1] * y[i-1]. So:

Y[i] = 0.5 * (3 - x[i-1] * y[i-1])
     = 1 + 0.5 * (1 - x[i-1] * y[i-1])

Then:

x[i] = x[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
     = x[i-1] + x[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))
y[i] = y[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
     = y[i-1] + y[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))

That is, given initial x and y, we can use the following update step:

r = 0.5 * (1 - x * y)
x' = x + x * r
y' = y + y * r

Or, even fancier, we can set h = 0.5 * y. This is the initialisation:

Y = approx_rsqrt(n)
x = Y * n
h = Y * 0.5

And this is the update step:

r = 0.5 - x * h
x' = x + x * r
h' = h + h * r

This is Goldschmidt's algorithm, and it has a huge advantage if you're implementing it in hardware: the "inner loop" is three multiply-adds and nothing else, and two of them are independent and can be pipelined.

In 1999, FPUs already needed a pipelined add/substract circuit and a pipelined multiply circuit, otherwise SSE would not be very "streaming". Only one of each circuit was needed in 1999 to implement this inner loop in a fully-pipelined way without wasting a lot of hardware just on square root.

Today, of course, we have fused multiply-add exposed to the programmer. Again, the inner loop is three pipelined FMAs, which are (again) generally useful even if you're not computing square roots.