What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).

What's the fastest way to do this in C?


The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.

I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:

uint32_t bitpack(const bool array[32]) {
    __mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
    tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
    return _mm256_movemask_epi8(tmp);
}

Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.


If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this

inline int pack8b(bool* a)
{
    uint64_t t = *((uint64_t*)a);
    return (0x8040201008040201*t >> 56) & 0xFF;
}

int pack32b(bool* a)
{
    return (pack8b(a +  0) << 24) | (pack8b(a +  8) << 16) |
           (pack8b(a + 16) <<  8) | (pack8b(a + 24) <<  0);
}

Explanation:

Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)

  |  a7  ||  a6  ||  a4  ||  a4  ||  a3  ||  a2  ||  a1  ||  a0  |
  .......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
  ────────────────────────────────────────────────────────────────
  ↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
  ↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
  ↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
  ↑..d.....↑.c......↑b.......a
  ↑.c......↑b.......a
  ↑b.......a
  a       
  ────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out

So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code

Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits


Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes

On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with

_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);

And to pack 2 uint32_t as the question requires use

_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);

Other answers contain an obvious loop implementation.

Here's a first variant:

unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
    result = (result<<1) + a[i];

On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See @Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).

Let us try something more radical.

Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.

Then the following code packs m*2 booleans into an int:

 (i1<<m+i2)

Using this we can pack 2^n bits as follows:

 unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer

 a2[0]=(a1[0]<<1)+a2[1];  // the original bits are a1[k]; can be scalar variables or ints
 a2[1]=(a1[2]<<1)+a1[3];  //  yes, you can use "|" instead of "+"
 ...
 a2[15]=(a1[30]<<1)+a1[31];

 a4[0]=(a2[0]<<2)+a2[1];
 a4[1]=(a2[2]<<2)+a2[3];
 ...
 a4[7]=(a2[14]<<2)+a2[15];

 a8[0]=(a4[0]<<4)+a4[1];
 a8[1]=(a4[2]<<4)+a4[3];
 a8[1]=(a4[4]<<4)+a4[5];
 a8[1]=(a4[6]<<4)+a4[7];

 a16[0]=(a8[0]<<8)+a8[1]);
 a16[1]=(a8[2]<<8)+a8[3]);

 a32[0]=(a16[0]<<16)+a16[1];

Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).

On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.

On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.

If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:

bool a1[32];

then we can abuse our knowledge of memory layout to fetch several at a time:

a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];

a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];

a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];

a32[0]=(a16[0]<<4)+a16[1];

Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).

To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).

As always, these solutions needed to performance tested.