Efficient way to OR adjacent bits in 64-bit integer

Solution 1:

Here is a portable C++ implementation. It seems to work during my brief testing. The deinterleave code is based on this SO question.

uint64_t calc(uint64_t n)
{
    // (odd | even)
    uint64_t x = (n & 0x5555555555555555ull) | ((n & 0xAAAAAAAAAAAAAAAAull) >> 1);

    // deinterleave
    x = (x | (x >> 1)) & 0x3333333333333333ull;
    x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full;
    x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull;
    x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull;
    x = (x | (x >> 16)) & 0x00000000FFFFFFFFull;

    return x;
}

gcc, clang, and msvc all compile this down to about 30 instructions.

From the comments, there is a modification that can be made.

Change the first line to use a single bitmask operation to select only the "odd" bits.

The possibly (?) improved code is:

uint64_t calc(uint64_t n)
{
    // (odd | even)
    uint64_t x = (n | (n >> 1)) & 0x5555555555555555ull; // single bits

    // ... the restdeinterleave
    x = (x | (x >> 1)) & 0x3333333333333333ull; // bit pairs
    x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full; // nibbles
    x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull; // octets
    x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull; // halfwords
    x = (x | (x >> 16)) & 0x00000000FFFFFFFFull; // words

    return x;
}

Solution 2:

Probably fastest solution for x86 architecture with BMI2 instruction set:

#include <stdint.h>
#include <x86intrin.h>

uint32_t calc (uint64_t a)
{
   return _pext_u64(a, 0x5555555555555555ull) |
          _pext_u64(a, 0xaaaaaaaaaaaaaaaaull);
}

This compiles to 5 instructions total.

Solution 3:

If you do not have pext and you still want to do this better than the trivial way then this extraction can be expressed as a logarithmic number (if you generalized it in terms of length) of bit moves:

// OR adjacent bits, destroys the odd bits but it doesn't matter
x = (x | (x >> 1)) & rep8(0x55);
// gather the even bits with delta swaps
x = bitmove(x, rep8(0x44), 1);   // make pairs
x = bitmove(x, rep8(0x30), 2);   // make nibbles
x = bitmove(x, rep4(0x0F00), 4); // make bytes
x = bitmove(x, rep2(0x00FF0000), 8); // make words
res = (uint32_t)(x | (x >> 16)); // final step is simpler

With:

bitmove(x, mask, step) {
    return x | ((x & mask) >> step);
}

repk is just so I could write shorter constants. rep8(0x44) = 0x4444444444444444 etc.

Also if you do have pext, you can do it with only one of them, which is probably faster and at least shorter:

_pext_u64(x | (x >> 1), rep8(0x55));

Solution 4:

Okay, let's make this more hacky then (might be buggy):

uint64_t x;

uint64_t even_bits = x & 0xAAAAAAAAAAAAAAAAull;
uint64_t odd_bits  = x & 0x5555555555555555ull;

Now, my original solution did this:

// wrong
even_bits >> 1;
unsigned int solution = even_bits | odd_bits;

However, as JackAidley pointed out, while this aligns the bits together, it doesn't remove the spaces from the middle!

Thankfully, we can use a very helpful _pext instruction from the BMI2 instruction set.

u64 _pext_u64(u64 a, u64 m) - Extract bits from a at the corresponding bit locations specified by mask m to contiguous low bits in dst; the remaining upper bits in dst are set to zero.

solution = _pext_u64(solution, odd_bits);

Alternatively, instead of using & and >> to separate out the bits, you might just use _pext twice on the original number with the provided masks (which would split it up into two contiguous 32-bit numbers), and then simply or the results.

If you don't have access to BMI2, though, I'm pretty sure the gap removal would still involve a loop; a bit simpler one than your original idea, perhaps.

Solution 5:

Slight improvement over the LUT approach (4 lookups instead of 8):

Compute the bitwise-or and clear every other bit. Then intertwine the bits of pairs of bytes to yield four bytes. Finally, reorder the bits in the four bytes (mapped on the quadword) by means of a 256-entries lookup-table:

Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL; // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes