Using Intrinsics to Extract And Shift Odd/Even Bits

Is there a way to optimize the following code using intrinsics? It takes all the odd indexed bits in a 16 bit integer and shifts them as far right as possible.

I was thinking maybe using the c++ equivalent of ISHFTC from Fortran (is there even a c++ equivalent of this?). But I feel that there is a more efficient way.

int x = some16bitInt;
x = x&0x5555;
int y = 0;
for (int i = 0; i < 8; i++)
    y = y | ((x >> i) & (0x01 << i));
'''

Solution 1:

  • x86: use BMI2 pext if available, except on Zen2 or earlier AMD.

  • Otherwise: @jorgbrown suggested a nice improvement over my bithack.

  • Or if you're doing a lot of this in a loop without fast pext, it's worth considering Jorg's table lookup idea after packing all the bits you want into the low 8 in some order, so the table is only 256 x 1-byte entries.


Fortran ISHFTC is just a rotate. C doesn't directly have this, but you can portably + safely write a function that compilers with pattern-recognize and compile to a single rotate instruction. Best practices for circular shift (rotate) operations in C++

I'm not sure that's a useful building block, but it is available.


On x86 with BMI2 instruction-set extensions, there's a pext bit-extract instruction, which you could use with a 0x5555 control input. See Intel's docs for _pext_u32 and _u64

It's very fast on Intel Haswell and later (1 uop, 3 cycle latency, 1/clock throughput),
but pretty slow on AMD before Zen 3 (Zen1/2: 7 uops, 18 cycle latency/throughput). https://agner.org/optimize/ and https://uops.info/. I think that's worse than the shift/mask stuff I've come up using pure C, especially if latency matters or doing this in a loop (not just front-end throughput).

#include <immintrin.h>

// Good on Intel, and AMD Zen3 and later.
unsigned extract_even_bits_bmi2(unsigned a) {
   return _pext_u32(a, 0x5555);
}

With GCC / clang, you have to compile with -mbmi2 (or better, -march=haswell) to enable use of BMI2 intrinsics.


Portable ISO C++

I don't think the usual multiply tricks (to get multiple input bytes shifted and added into the top byte of a result) will work here; you have too many bits and they're too close together. See How to count the number of set bits in a 32-bit integer? for a use-case :
((n & 0x0F0F0F0F) * 0x01010101) >> 24 to horizontally add all the bytes in n.

You could imagine using something like that on your input with * 0x08040201 to align the bits from different bytes differently. But that still leaves major unsolved problems. Perhaps SIMD multiply with 8-bit elements to get pairs of bits shifted together?

But that's not better than moving bits around by masking, shifting, and ORing or ADDing the moved bits with the not-moving bits. With about log2(n_bits) steps, we can get all the bits contiguous.

There are multiple ways to do this, see on Godbolt. There's room for improvement in this, like tweaking to compile better for one ISA vs. another. e.g. helping some ARM compilers see that 0b0000011000000110 is just the other constant right-shifted, so it can and r0, r1, r2, lsr #4 or something.

Or to shift bits to the right instead of left, for ISAs that can't do anything special for left.

unsigned pack_even_bits16_v2(unsigned x)
{
      // ARM / ARM64: repeat these bit-patterns to fill 32 bits,
      // so they fit in an immediate for AND.
      // but that's worse for other RISCs like PowerPC
    x &= 0x5555;        // 0a0b0c0d0e0f0g0h
    x += x<<1;          // aabbccddeeffgghh    // x86 LEA eax, [rdi + rdi*2]
    unsigned move = x &  0b0000011000000110;   // bits to move
    unsigned keep = x &  0b0110000001100000;   // bits to keep
    x = keep + (move << 2);  // 0abcd000 0efgh000

                       // 0abcd000 0efgh000    // with byte boundary shown
    unsigned tmp = x >> 7;  // high group into place, shifting out the low bits
    x &= 0xFF;    // grab the whole low byte ; possibly with a zero-latency movzx
    x = (x>>3) | tmp;
    return x;
}

I'm shifting low bits left instead of shifting high bits right because x86 can left-shift-and-add with one instruction, LEA. On other ISAs it would probably save one shift at the end to move bits to the right.

This compiles pretty nicely for AArch64 and PowerPC64 as well as x86. Clang sees through this bit manipulation for PowerPC and uses the powerful rlwinm (Rotate Left Word Immediate AND Mask) and rlwimi (... Mask Insert) instructions :) At least it did. Unfortunately current clang trunk is now doing two mulli multiply instructions to start with, before rlwinm + 3x rlwimi; the asm below is from when this answer was new.

# clang trunk -O3 for PowerPC64.
# Compiling the  x += x & 0x1111;  version, not the  x += x<<1 version where we get a multiply
        andi. 4, 3, 21845        # x & 0x5555
        andi. 3, 3, 4369         # x & 0x1111
        add 4, 4, 3              # 
        rlwinm 3, 4, 31, 30, 31  # isolate the low 2 bits.  PPC counts bits from MSB=0 LSB=31 for 32-bit registers
        rlwimi 3, 4, 29, 28, 29  # insert the next 2-bit bitfield
        rlwimi 3, 4, 27, 26, 27  # ...
        rlwimi 3, 4, 25, 24, 25
        blr

It would have been better to combine pairs instead of forming one big chain.


Jorg's improved version: move bits by adding to themselves

Masking to keep some bits, then adding that to the original, will clear the original position and produce a carry one position left. Assuming the next higher space was already zeroed, this shifts those bits, while leaving other bits in place.

This also uses inline asm to work around a GCC/clang missed optimization where they don't just use movzx on x86 to zero-extend a byte. The seem to have re-arranged some of the surrounding logic and end up costing more instructions.

unsigned pack_even_bits16_jorg(unsigned x) {
  //      x = ?a?b?c?d ?e?f?g?h
  x  &=     0b01010101'01010101;
  //      x = 0a0b0c0d 0e0f0g0h
  x += (x & 0b00010001'00010001);  // move bits left by adding to themselves
  //      x = 0ab00cd0 0ef00gh0
  x += x << 2;
  //      x = 0abcdcde fefghgh0
  x >>= 3;
  //      x = 0000abcd cdefefgh
  x  &=     0b00001111'00001111;
  //      x = 0000abcd 0000efgh
  unsigned out;

  #if 0 || !defined(__GNUC__) || !( defined(__x86__)||defined(__x86_64__) )
    out = (unsigned char)x;   // MSVC correctly uses MOVZX here.
  #else  // Work around gcc/clang missed optimization.  TODO: __builtin_constant_p(x) to use pure C for constprop.
    asm("movzb {%b1, %0 | %0, %b1}" : "=r"(out) : "r"(x));  // AT&T | Intel dialect alternatives so it compiles ok with -masm=intel
    // alternatively  shl $4, %ah  ; or %ah, %al   avoids a movzx if you only need the low byte.  But that writes AH, renaming it separately on Intel.
  #endif

  out += x >> 4;
  return out;
}

See it on Godbolt with test code. It compiles equally well for ARM64, better for PowerPC, and better for x86 / x86-64. And probably better for ARM64 if you adjust the AND constant patterns to repeat out to 32 bits so GCC can use them as immediates.


Another way to move bits is to zero the selected bits with XOR, then shift and deposit them somewhere else with a shift and add.

   unsigned tmp = x & mask;
    x += tmp;          // left shift those bits
    x += tmp<<1;       // left shift them again.  (x86 can do this with LEA eax, [rax + rdx*2])

or

    unsigned tmp = x &   0b0000011000000110;   // bits to move
    x ^= tmp;          // clear those bits
    x += tmp << 2;     // LEA eax, [eax + edx*4]  1 fast instruction on x86

When only moving by 2 positions, add + shift-and-add is basically the same length of dependency chain as xor + shift-and-add.

But clearing the old bits conditionally instead of with the opposite mask is probably worse. At least if the opposite mask fits in an immediate, or if the ISA has an ANDNOT instruction. Or for ARM, a shifted mask. AND 2 ways on the old x can run in parallel, vs. tmp = x & mask; x ^= tmp serializing execution with a data dependency if it compiles as written. (It doesn't; gcc and clang are smart enough to know what XOR does and unconditionally clear those bits.)

Solution 2:

The most flexible bit manipulation in x86 (indeed, almost any CPU) is indexed read-from-memory. It can do completely arbitrary mappings in constant-time, typically in 1-4 cycles (assuming the memory is cached).

Since you're only talking about 8 bits, and you can easily put the bits you want into the lower 8 bits of a register, albeit in the wrong order, you can just use a lookup table.

unsigned pack_even_bits16_table(unsigned x) {  // x = ?a?b?c?d ?e?f?g?h
  size_t m1 = x & 0x55;         //  m1 = 0e0f0g0h
  size_t m2 = (x >> 7) & 0xAA;  //  m2 = a0b0c0d0
  return map[m1 + m2];          // sum = aebfcgdh
}

where the map is

const unsigned char map[256] = {
    0,   1,   16,  17,  2,   3,   18,  19,  32,  33,  48,  49,  34,  35,  50,  51,
    4,   5,   20,  21,  6,   7,   22,  23,  36,  37,  52,  53,  38,  39,  54,  55,
    64,  65,  80,  81,  66,  67,  82,  83,  96,  97,  112, 113, 98,  99,  114, 115,
    68,  69,  84,  85,  70,  71,  86,  87,  100, 101, 116, 117, 102, 103, 118, 119,
    8,   9,   24,  25,  10,  11,  26,  27,  40,  41,  56,  57,  42,  43,  58,  59,
    12,  13,  28,  29,  14,  15,  30,  31,  44,  45,  60,  61,  46,  47,  62,  63,
    72,  73,  88,  89,  74,  75,  90,  91,  104, 105, 120, 121, 106, 107, 122, 123,
    76,  77,  92,  93,  78,  79,  94,  95,  108, 109, 124, 125, 110, 111, 126, 127,
    128, 129, 144, 145, 130, 131, 146, 147, 160, 161, 176, 177, 162, 163, 178, 179,
    132, 133, 148, 149, 134, 135, 150, 151, 164, 165, 180, 181, 166, 167, 182, 183,
    192, 193, 208, 209, 194, 195, 210, 211, 224, 225, 240, 241, 226, 227, 242, 243,
    196, 197, 212, 213, 198, 199, 214, 215, 228, 229, 244, 245, 230, 231, 246, 247,
    136, 137, 152, 153, 138, 139, 154, 155, 168, 169, 184, 185, 170, 171, 186, 187,
    140, 141, 156, 157, 142, 143, 158, 159, 172, 173, 188, 189, 174, 175, 190, 191,
    200, 201, 216, 217, 202, 203, 218, 219, 232, 233, 248, 249, 234, 235, 250, 251,
    204, 205, 220, 221, 206, 207, 222, 223, 236, 237, 252, 253, 238, 239, 254, 255,
};