Can long integer routines benefit from SSE?

In the past, the answer to this question was a solid, "no". But as of 2017, the situation is changing.

But before I continue, time for some background terminology:

Full Word Arithmetic
Partial Word Arithmetic

Full-Word Arithmetic:

This is the standard representation where the number is stored in base 2³² or 2⁶⁴ using an array of 32-bit or 64-bit integers. Many bignum libraries and applications (including GMP) use this representation.

In full-word representation, every integer has a unique representation. Operations like comparisons are easy. But stuff like addition are more difficult because of the need for carry-propagation.

It is this carry-propagation that makes bignum arithmetic almost impossible to vectorize.

Partial-Word Arithmetic

This is a lesser-used representation where the number uses a base less than the hardware word-size. For example, putting only 60 bits in each 64-bit word. Or using base 1,000,000,000 with a 32-bit word-size for decimal arithmetic.

The authors of GMP call this, "nails" where the "nail" is the unused portion of the word.

In the past, use of partial-word arithmetic was mostly restricted to applications working in non-binary bases. But nowadays, it's becoming more important in that it allows carry-propagation to be delayed.

Problems with Full-Word Arithmetic:

Vectorizing full-word arithmetic has historically been a lost cause:

SSE/AVX2 has no support for carry-propagation.
SSE/AVX2 has no 128-bit add/sub.
SSE/AVX2 has no 64 x 64-bit integer multiply.*

*AVX512-DQ adds a lower-half 64x64-bit multiply. But there is still no upper-half instruction.

Furthermore, x86/x64 has plenty of specialized scalar instructions for bignums:

Add-with-Carry: adc, adcx, adox.
Double-word Multiply: Single-operand mul and mulx.

In light of this, both bignum-add and bignum-multiply are difficult for SIMD to beat scalar on x64. Definitely not with SSE or AVX.

With AVX2, SIMD is almost competitive with scalar bignum-multiply if you rearrange the data to enable "vertical vectorization" of 4 different (and independent) multiplies of the same lengths in each of the 4 SIMD lanes.

AVX512 will tip things more in favor of SIMD again assuming vertical vectorization.

But for the most part, "horizontal vectorization" of bignums is largely still a lost cause unless you have many of them (of the same size) and can afford the cost of transposing them to make them "vertical".

Vectorization of Partial-Word Arithmetic

With partial-word arithmetic, the extra "nail" bits enable you to delay carry-propagation.

So as long as you as you don't overflow the word, SIMD add/sub can be done directly. In many implementations, partial-word representation uses signed integers to allow words to go negative.

Because there is (usually) no need to perform carryout, SIMD add/sub on partial words can be done equally efficiently on both vertically and horizontally-vectorized bignums.

Carryout on horizontally-vectorized bignums is still cheap as you merely shift the nails over the next lane. A full carryout to completely clear the nail bits and get to a unique representation usually isn't necessary unless you need to do a comparison of two numbers that are almost the same.

Multiplication is more complicated with partial-word arithmetic since you need to deal with the nail bits. But as with add/sub, it is nevertheless possible to do it efficiently on horizontally-vectorized bignums.

AVX512-IFMA (coming with Cannonlake processors) will have instructions that give the full 104 bits of a 52 x 52-bit multiply (presumably using the FPU hardware). This will play very well with partial-word representations that use 52 bits per word.

Large Multiplication using FFTs

For really large bignums, multiplication is most efficiently done using Fast-Fourier Transforms (FFTs).

FFTs are completely vectorizable since they work on independent doubles. This is possible because fundamentally, the representation that FFTs use is a partial word representation.

To summarize, vectorization of bignum arithmetic is possible. But sacrifices must be made.

If you expect SSE/AVX to be able to speed up some existing bignum code without fundamental changes to the representation and/or data layout, that's not likely to happen.

But nevertheless, bignum arithmetic is possible to vectorize.

Disclosure:

I'm the author of y-cruncher which does plenty of large number arithmetic.

Can long integer routines benefit from SSE?

Related

Recent Posts