Using bts assembly instruction with gcc compiler
Solution 1:
BTS
(and the other BT*
insns) with a memory destination are slow. (>10 uops on Intel). You'll probably get faster code from doing the address math to find the right byte, and loading it into a register. Then you can do the BT
/ BTS
with a register destination and store the result.
Or maybe shift a 1
to the right position and use OR
with with a memory destination for SetBit, or AND
with a memory source for TestBit
. Of course, if you avoid inline asm, the compiler can inline TestBit
and use TEST
instead of AND
, which is useful on some CPUs (since it can macro-fuse into a test-and-branch on more CPUs than AND
).
This is in fact what gcc 5.2 generates from your C source (memory-dest OR
or TEST
). Looks optimal to me (fewer uops than a memory-dest bt
). Actually, note that your code is broken because it assumes unsigned long
is 32 bits, not CHAR_BIT * sizeof(unsigned_long)
. Using uint32_t
, or char
, would be a much better plan. Note the sign-extension of eax
into rax
with the cqde
instruction, due to the badly-written C which uses 1
instead of 1UL
.
Also note that inline asm can't return the flags as a result (except with a new-in-gcc v6 extension!), so using inline asm for TestBit would probably result in terrible code code like:
... ; inline asm
bt reg, reg
setc al ; end of inline asm
test al, al ; compiler-generated
jz bit_was_zero
Modern compilers can and do use BT
when appropriate (with a register destination). End result: your C probably compiles to faster code than what you're suggesting doing with inline asm. It would be even faster after being bugfixed to be correct and 64bit-clean. If you were optimizing for code size, and willing to pay a significant speed penalty, forcing use of bts
could work, but bt
probably still won't work well (because the result goes into the flags).
Solution 2:
inline void SetBit(*array, bit) {
asm("bts %1,%0" : "+m" (*array) : "r" (bit));
}