How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?
The intrinsic:
int mask = _mm256_movemask_epi8(__m256i s1)
creates a mask, with its 32
bits corresponding to the most significant bit of each byte of s1
. After manipulating the mask using bit operations (BMI2
for example) I would like to perform the inverse of _mm256_movemask_epi8
, i.e., create a __m256i
vector with the most significant bit of each byte containing the corresponding bit of the uint32_t mask
.
What is the best way to do this?
Edit:
I need to perform the inverse because the intrinsic _mm256_blendv_epi8
accepts only __m256i
type mask instead of uint32_t
. As such, in the resulting __m256i
mask, I can ignore the bits other than the MSB of each byte.
Solution 1:
I have implemented the above three approaches on a Haswell machine. Evgeny Kluev's approach is the fastest (1.07 s), followed by Jason R's (1.97 s) and Paul R's (2.44 s). The code below was compiled with -march=core-avx2 -O3 optimization flags.
#include <immintrin.h>
#include <boost/date_time/posix_time/posix_time.hpp>
//t_icc = 1.07 s
//t_g++ = 1.09 s
__m256i get_mask3(const uint32_t mask) {
__m256i vmask(_mm256_set1_epi32(mask));
const __m256i shuffle(_mm256_setr_epi64x(0x0000000000000000,
0x0101010101010101, 0x0202020202020202, 0x0303030303030303));
vmask = _mm256_shuffle_epi8(vmask, shuffle);
const __m256i bit_mask(_mm256_set1_epi64x(0x7fbfdfeff7fbfdfe));
vmask = _mm256_or_si256(vmask, bit_mask);
return _mm256_cmpeq_epi8(vmask, _mm256_set1_epi64x(-1));
}
//t_icc = 1.97 s
//t_g++ = 1.97 s
__m256i get_mask2(const uint32_t mask) {
__m256i vmask(_mm256_set1_epi32(mask));
const __m256i shift(_mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0));
vmask = _mm256_sllv_epi32(vmask, shift);
const __m256i shuffle(_mm256_setr_epi64x(0x0105090d0004080c,
0x03070b0f02060a0e, 0x0105090d0004080c, 0x03070b0f02060a0e));
vmask = _mm256_shuffle_epi8(vmask, shuffle);
const __m256i perm(_mm256_setr_epi64x(0x0000000000000004, 0x0000000100000005,
0x0000000200000006, 0x0000000300000007));
return _mm256_permutevar8x32_epi32(vmask, perm);
}
//t_icc = 2.44 s
//t_g++ = 2.45 s
__m256i get_mask1(uint32_t mask) {
const uint64_t pmask = 0x8080808080808080ULL; // bit unpacking mask for PDEP
uint64_t amask0, amask1, amask2, amask3;
amask0 = _pdep_u64(mask, pmask);
mask >>= 8;
amask1 = _pdep_u64(mask, pmask);
mask >>= 8;
amask2 = _pdep_u64(mask, pmask);
mask >>= 8;
amask3 = _pdep_u64(mask, pmask);
return _mm256_set_epi64x(amask3, amask2, amask1, amask0);
}
int main() {
__m256i mask;
boost::posix_time::ptime start(
boost::posix_time::microsec_clock::universal_time());
for(unsigned i(0); i != 1000000000; ++i)
{
mask = _mm256_xor_si256(mask, get_mask3(i));
}
boost::posix_time::ptime end(
boost::posix_time::microsec_clock::universal_time());
std::cout << "duration:" << (end-start) <<
" mask:" << _mm256_movemask_epi8(mask) << std::endl;
return 0;
}
Solution 2:
Here is an alternative to LUT or pdep
instructions that might be more efficient:
- Copy your 32-bit mask to both low bytes of some
ymm
register and bytes 16..19 of the same register. You could use temporary array and_mm256_load_si256
. Or you could move single copy of 32-bit mask to low bytes of someymm
register, then broadcast it withVPBROADCASTD (_mm_broadcastd_epi32)
or other broadcast/shuffle instructions. - Rearrange bytes of the register so that low 8 bytes (each) contain low 8 bits of your mask, next 8 bytes - next 8 bits, etc. This could be done with
VPSHUFB (_mm256_shuffle_epi8)
with control register containing '0' in low 8 bytes, '1' in next 8 bytes, etc. - Select proper bit for each byte with
VPOR (_mm256_or_si256)
orVPAND (_mm256_and_si256)
. - Set MSB of appropriate bytes with
VPCMPEQB (_mm256_cmpeq_epi8)
. Compare each byte to0xFF
. If you want each bit of the mask toggled, useVPAND
on previous step and compare to zero.
Additional flexibility of this approach is that you could choose different control register for step #2 and different mask for step #3 to shuffle bits of your bit mask (for example you could copy this mask to ymm
register in reversed order).
Solution 3:
My initial approach to this was similar to @Jason R's because that is how "normal" operations work, but most of these operations only care about the high bit -- ignoring all the other bits. Once I realized this, the _mm*_maskz_broadcast*_epi*(mask,__m128i)
series of functions made the most sense. You will need to enable -mavx512vl and -mavx512bw (gcc)
To get a vector with the highest bit of each byte set according to a mask:
/* convert 16 bit mask to __m128i control byte mask */
_mm_maskz_broadcastb_epi8((__mmask16)mask,_mm_set1_epi32(~0))
/* convert 32 bit mask to __m256i control byte mask */
_mm256_maskz_broadcastb_epi8((__mmask32)mask,_mm_set1_epi32(~0))
/* convert 64 bit mask to __m512i control byte mask */
_mm512_maskz_broadcastb_epi8((__mmask64)mask,_mm_set1_epi32(~0))
To get a vector with the highest bit of each word set according to a mask:
/* convert 8 bit mask to __m128i control word mask */
_mm_maskz_broadcastw_epi16((__mmask8)mask,_mm_set1_epi32(~0))
/* convert 16 bit mask to __m256i control word mask */
_mm256_maskz_broadcastw_epi16((__mmask16)mask,_mm_set1_epi32(~0))
/* convert 32 bit mask to __m512i control word mask */
_mm512_maskz_broadcastw_epi16((__mmask32)mask,_mm_set1_epi32(~0))
To get a vector with the highest bit of each double word set according to a mask:
/* convert 8 bit mask to __m256i control mask */
_mm256_maskz_broadcastd_epi32((__mmask8)mask,_mm_set1_epi32(~0))
/* convert 16 bit mask to __m512i control mask */
_mm512_maskz_broadcastd_epi32((__mmask16)mask,_mm_set1_epi32(~0))
To get a vector with the highest bit of each quad word set according to a mask:
/* convert 8 bit mask to __m512i control mask */
_mm512_maskz_broadcastq_epi64((__mmask8)mask,_mm_set1_epi32(~0))
The one specific to this question is: _mm256_maskz_broadcastb_epi8((__mmask32)mask,_mm_set1_epi32(~0))
but I include the others for reference/comparison.
Note that each byte/word/... will either be all ones or all zeroes according to the mask (not just the highest bit). This can also be useful for doing vectorized bit operations (&'ing with another vector for instance to zero out unwanted bytes/words).
Another note: each _mm_set1_epi32(~0)
could/should be converted to a constant (either manually or by the compiler), so it should compile to just one fairly quick operation, though it may be slightly faster in testing than in real life since the constant will likely stay in a register. Then these are converted to VPMOVM2{b,w,d,q} instructions
Edit: In case your compiler doesn't support AVX512, the inline assembly version should look like:
inline __m256i dmask2epi8(__mmask32 mask){
__m256i ret;
__asm("vpmovm2b %1, %0":"=x"(ret):"k"(mask):);
return ret;
}
The other instructions are similar.
Solution 4:
Here's another implementation that might work on AVX2 since you had that tag on your question (it is untested since I don't have a Haswell machine). It is similar to Evgeny Kluev's answer, but it might take fewer instructions. It requires two constant __m256i
masks, though. If you're doing this many times in a loop, then the overhead of setting up those constants once ahead of time may be negligible.
Take your 32-bit mask and broadcast it to all 8 slots of a
ymm
register using_mm_broadcastd_epi32()
.Create a
__m256i
holding 8 32-bit integers with values[0, 1, 2, 3, 4, 5, 6, 7]
(from the least-significant to most-significant element).Use that constant mask to rotate each of the 32-bit integers in your
ymm
register left by a different amount, using_mm256_sllv_epi32()
.Now, if we view the
ymm
register as holding 8-bit integers and look at their MSBs, then the register now holds the MSBs for byte indices[7, 15, 23, 31, 6, 14, 22, 30, 5, 13, 21, 29, 4, 12, 20, 28, 3, 11, 19, 27, 2, 10, 18, 26, 1, 9, 17, 25, 0, 8, 16, 24]
(from the least-significant to the most-significant element).Use a bitwise-AND against a constant mask of
[0x80, 0x80, 0x80, ...]
to isolate the MSBs from each byte.Use a sequence of shuffles and/or permutes to get the elements back in the order that you want. Unfortunately, there is no any-to-any permute for 8-bit integers like there are for floating-point values in AVX2.