Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2
I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i
or __m512i
. To calculate sum of n elements, I ofter use log2(n) vpaddd
and vpermd
function, then extract the final result. Howerver, it is not the best option I think.
Edit: best/optimal in term of speed/cycle reduction.
Solution 1:
Related: if you're looking for the non-existant _mm512_reduce_add_epu8
, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw
as an hsum within qwords is much more efficient than shuffling.
Without AVX512, see hsum_8x32(__m256i)
below for AVX2 without Intel's reduce_add
helper function. reduce_add
doesn't necessarily compile optimally anyway with AVX512.
There is a int _mm512_reduce_add_epi32(__m512i)
inline function in immintrin.h
. You might as well use it. (It compiles to shuffle and add instructions, but more efficient ones than vpermd
, like I describe below.) AVX512 didn't introduce any new hardware support for horizontal sums, just this new helper function. It's still something to avoid or sink out of loops whenever possible.
GCC 9.2 -O3 -march=skylake-avx512
compiles a wrapper that calls it as follows:
vextracti64x4 ymm1, zmm0, 0x1
vpaddd ymm1, ymm1, ymm0
vextracti64x2 xmm0, ymm1, 0x1 # silly compiler, vextracti128 would be shorter
vpaddd xmm1, xmm0, xmm1
vpshufd xmm0, xmm1, 78
vpaddd xmm0, xmm0, xmm1
vmovd edx, xmm0
vpextrd eax, xmm0, 1 # 2x xmm->integer to feed scalar add.
add eax, edx
ret
Extracting twice to feed scalar add is questionable; it needs uops for p0 and p5 so it's equivalent to a regular shuffle + a movd
.
Clang doesn't do that; it does one more step of shuffle / SIMD add to reduce down to a single scalar for vmovd
. See below for perf analysis of the two.
There is a VPHADDD
but you should never use it with both inputs the same. (Unless you're optimizing for code-size over speed). It can be useful to transpose-and-sum multiple vectors, resulting in some vectors of results. You do that by feeding phadd
with 2 different inputs. (Except it gets messy with 256 and 512-bit because vphadd
is still only in-lane.)
Yes, you need log2(vector_width)
shuffles and vpaddd
instructions. (So this isn't very efficient; avoid horizontal sums inside inner loops. Accumulate vertically until the end of a loop, for example).
General strategy for all SSE / AVX / AVX512
You want to successively narrow from 512 -> 256, then 256 -> 128, then shuffle within __m128i
until you're down to one scalar element. Presumably some future AMD CPU will decode 512-bit instructions to two 256-bit uops, so reducing width is a big win there. And narrower instructions presumably cost slightly less power.
Your shuffles can take immediate control operands, not vectors for vpermd
. e.g. VEXTRACTI32x8
, vextracti128
, and vpshufd
. (Or vpunpckhqdq
to save code size for the immediate constant.)
See Fastest way to do horizontal SSE vector sum (or other reduction) (my answer also includes some integer versions).
This general strategy is appropriate for all element types: float, double, and any size integer
Special cases:
-
8-bit integer: start with
vpsadbw
, more efficient and avoids overflow, but then continue as for 64-bit integers. -
16-bit integer: start by widening to 32 with
pmaddwd
(_mm256_madd_epi16
with set1_epi16(1)) : SIMD: Accumulate Adjacent Pairs - fewer uops even if you don't care about the avoiding-overflow benefit, except on AMD before Zen2 where 256-bit instructions cost at least 2 uops. But then you continue as for 32-bit integer.
32-bit integer can be done manually like this, with an SSE2 function called by the AVX2 function after reducing to __m128i
, in turn called by the AVX512 function after reducing to __m256i
. The calls will of course inline in practice.
#include <immintrin.h>
#include <stdint.h>
// from my earlier answer, with tuning for non-AVX CPUs removed
// static inline
uint32_t hsum_epi32_avx(__m128i x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x); // 3-operand non-destructive AVX lets us save a byte without needing a movdqa
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1)); // Swap the low two elements
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32); // movd
}
// only needs AVX2
uint32_t hsum_8x32(__m256i v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1)); // silly GCC uses a longer AXV512VL instruction if AVX512 is enabled :/
return hsum_epi32_avx(sum128);
}
// AVX512
uint32_t hsum_16x32(__m512i v)
{
__m256i sum256 = _mm256_add_epi32(
_mm512_castsi512_si256(v), // low half
_mm512_extracti64x4_epi64(v, 1)); // high half. AVX512F. 32x8 version is AVX512DQ
return hsum_8x32(sum256);
}
Notice that this uses __m256i
hsum as a building block for __m512i
; there's nothing to be gained by doing in-lane operations first.
Well possibly a very tiny advantage: in-lane shuffles have lower latency than lane-crossing, so they could execute 2 cycles earlier and leave the RS earlier, and similarly retire from the ROB slightly earlier. But the higher-latency shuffles are coming just a couple instructions later even if you did that. So you might get a handful of some independent instructions into the back-end 2 cycles earlier if this hsum was on the critical path (blocking retirement).
But reducing to a narrower vector width sooner is generally good, maybe getting 512-bit uops out of the system sooner so the CPU can re-activate the SIMD execution units on port 1, if you aren't doing more 512-bit work right away.
Compiles on Godbolt to these instructions, with GCC9.2 -O3 -march=skylake-avx512
hsum_16x32(long long __vector(8)):
vextracti64x4 ymm1, zmm0, 0x1
vpaddd ymm0, ymm1, ymm0
vextracti64x2 xmm1, ymm0, 0x1 # silly compiler uses a longer EVEX instruction when its available (AVX512VL)
vpaddd xmm0, xmm0, xmm1
vpunpckhqdq xmm1, xmm0, xmm0
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 177
vpaddd xmm0, xmm1, xmm0
vmovd eax, xmm0
ret
P.S.: perf analysis of GCC's _mm512_reduce_add_epi32
vs. clang's (which is equivalent to my version), using data from https://uops.info/ and/or Agner Fog's instruction tables:
After inlining into a caller that does something with the result, it could allow optimizations like adding a constant as well using lea eax, [rax + rdx + 123]
or something.
But other than that it seems almost always worse than the the shuffle / vpadd / vmovd at the end of my implementation, on Skylake-X:
- total uops: reduce: 4. Mine: 3
- ports: reduce: 2p0, p5 (part of vpextrd), p0156 (scalar
add
) - ports: mine: p5, p015 (
vpadd
on SKX), p0 (vmod
)
Latency is equal at 4 cycles, assuming no resource conflicts:
- shuffle 1 cycle -> SIMD add 1 cycle -> vmovd 2 cycles
- vpextrd 3 cycles (in parallel with 2 cycle vmovd) -> add 1 cycle.