Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision

Solution 1:

There are lots of examples of the algorithm in practice. For example:

Newton Raphson with SSE2 - can someone explain me these 3 lines has an answer explaining the iteration used by one of Intel's examples.

For perf analysis on let's say Haswell (which can FP mul on two execution ports, unlike previous designs), I'll reproduce the code here (with one op per line). See http://agner.org/optimize/ for tables of instruction throughput and latency, and for docs on how to understand more background.

// execute (aka dispatch) on cycle 1, results ready on cycle 6
nr = _mm_rsqrt_ps( x );

// both of these execute on cycle 6, results ready on cycle 11
xnr = _mm_mul_ps( x, nr );         // dep on nr
half_nr = _mm_mul_ps( half, nr );  // dep on nr

// can execute on cycle 11, result ready on cycle 16
muls = _mm_mul_ps( xnr , nr );     // dep on xnr

// can execute on cycle 16, result ready on cycle 19
three_minus_muls = _mm_sub_ps( three, muls );  // dep on muls

// can execute on cycle 19, result ready on cycle 24
result = _mm_mul_ps( half_nr, three_minus_muls ); // dep on three_minus_muls
// result is an approximation of 1/sqrt(x), with ~22 to 23 bits of precision in the mantissa.

There is a LOT of room for other computation to overlap here, if it's not part of the dependency chain. However, if the data for each iteration of your code is dependent on data from the previous, you might be better off with the 11-cycle latency of sqrtps. Or even if each loop iteration is long enough that out-of-order execution can't hide it all by overlapping independent iterations.

To get sqrt(x) instead of 1/sqrt(x), multiply by x (and fixup if x can be zero, e.g. by masking (_mm_andn_ps) with the result of a CMPPS against 0.0). The optimal way is to replace half_nr with half_xnr = _mm_mul_ps( half, xnr );. That doesn't lengthen the dep chain any, because half_xnr can start on cycle 11 but isn't needed until the end (cycle 19). Same with FMA available: no increase in total latency.

If 11 bits of precision are enough (no Newton iteration), Intel's optimization manual (section 11.12.3) suggests using rcpps(rsqrt(x)), which is worse than multiplying by the original x, at least with AVX. It maybe saves a movdqa instruction with 128-bit SSE, but 256b rcpps is slower than 256b mul or fma. (And it lets you add the sqrt result to something for free with FMA instead of mul for the last step).


The SSE version of this loop, not accounting for any move instructions, is 6 uops. This means it should have a throughput on Haswell of one per 3 cycles (given that two execution ports can handle FP mul, and rsqrt is on the opposite port from FP add/sub). On SnB/IvB (and probably Nehalem), it should have a throughput of one per 5 cycles, since mulps and rsqrtps compete for port 0. subps is on port1, which isn't the bottleneck.

For Haswell, we can use FMA to combine the subtract with the mul. However, the dividers / sqrt unit isn't 256b wide, so unlike everything else, divps / sqrtps / rsqrtps / rcpps on ymm regs takes extra uops and extra cycles.

// vrsqrtps ymm has higher latency
// execute on cycle 1, results ready on cycle 8
nr = _mm256_rsqrt_ps( x );

// both of can execute on cycle 8, results ready on cycle 13
xnr = _mm256_mul_ps( x, nr );         // dep on nr
half_nr = _mm256_mul_ps( half, nr );  // dep on nr

// can execute on cycle 13, result ready on cycle 18
three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three );  // -(xnr*nr) + 3

// can execute on cycle 18, result ready on cycle 23
result = _mm256_mul_ps( half_nr, three_minus_muls ); // dep on three_minus_muls

We save 3 cycles with FMA. This is offset by using 2-cyle-slower 256b rsqrt, for a net gain of 1 cycle less latency (pretty good for twice as wide). SnB/IvB AVX would be the 24c latency for 128b, 26c latency for 256b.

The 256b FMA version uses 7 uops total. (VRSQRTPS is 3 uops, 2 for p0, and one for p1/5.) 256b mulps and fma are both single-uop instructions, and both can run on port 0 or port 1. (p0 only on pre-Haswell). So throughput should be: one per 3c, if the OOO engine dispatches uops to optimal execution ports. (i.e. the shuffle uop from rsqrt always goes to p5, never to p1 where it would take up mul/fma bandwidth.) As far as overlapping with other computation, (not just independent execution of itself), again it's pretty lightweight. Only 7 uops with a 23 cycles dep chain leaves lots of room for other stuff to happen while those uops sit in the re-order buffer.

If this is a step in a giant dep chain with nothing else going on (not even an independent next iteration), then 256b vsqrtps is 19 cycle latency, with a throughput of one per 14 cycles. (Haswell). If you still really need the reciprocal, then 256b vdivps also has 18-21c latency, with one per 14c throughput. So for normal sqrt, the instruction has lower latency. For recip sqrt, an iteration of the approximation is less latency. (And FAR more throughput, if it's overlapping with itself. If overlapping with other stuff that doesn't the divide unit, sqrtps isn't a problem.)

sqrtps can be a throughput win vs. rsqrt + Newton iteration, if it's part of a loop body with enough other non-divide and non-sqrt work going on that the divide unit isn't saturated.

This is especially true if you need sqrt(x), not 1/sqrt(x). e.g. on Haswell with AVX2, a copy+arcsinh loop over an array of floats that fits in L3 cache, implemented as fastlog(v + sqrt(v*v + 1)), runs at about the same throughput with real VSQRTPS or with VRSQRTPS + a Newton-Raphson iteration. (Even with a very fast approximation for log(), so the total loop body is about 9 FMA/add/mul/convert operations, and 2 boolean, plus the VSQRTPS ymm. There is a speedup from using just fastlog(v2_plus_1 * rsqrt(v2_plus_1) + v2_plus_1), so it's not bottlenecked on memory bandwidth, but it might be bottlenecking on latency (so out-of-order execution can't exploit all the parallelism of independent iterations).

Other precisions

For half-precision, there are no instructions to do computation on half floats. You're meant to convert on the fly as you load/store, using the conversion instructions.

For double-precision, there is no rsqrtpd. Presumably, the thinking is that if you don't need full precision, you should be using float in the first place. So you could convert to float and back, then do the exact same algorithm, but with pd instead of ps intrinsics. Or, you could keep your data as float for a while. e.g. convert two ymm registers of doubles into one ymm register of singles.

Unfortunately there isn't a single instruction that takes two ymm registers of doubles and outputs a ymm of singles. You have to go ymm->xmm twice, then _mm256_insertf128_ps one xmm to the high 128 of the other. But then you can feed that 256b ymm vector to the same algorithm.

If you are going to convert back to double right after, it may make sense to do the rsqrt + Newton-Raphson iteration on the two 128b registers of singles separately. The extra 2 uops to insert / extract, and the extra 2 uops for 256b rsqrt, start to add up, not to mention the 3-cycle latency of vinsertf128 / vextractf128. Computation will overlap, and both results will be ready a couple cycles apart. (Or 1 cycle apart, if you have a special version of your code for interleaving operations on 2 inputs at once).

Remember that single-precision has a smaller range of exponents than double, so the conversion can overflow to +Inf or underflow to 0.0. If you're not sure, definitely just use the normal _mm_sqrt_pd.


With AVX512F, there's _mm512_rsqrt14_pd( __m512d a). With AVX512ER (KNL but not SKX or Cannonlake), there's _mm512_rsqrt28_pd. _ps versions of course also exist. 14 bits of mantissa precision may be enough to use without a Newton iteration in more cases.

A Newton iteration still won't give you a correctly rounded single-precision float the way regular sqrt will.