Near constant time rotate that does not violate the standards
Solution 1:
I've linked to this answer for the full details from several other "rotate" questions, including this community wiki question, which should be kept up to date with best-practices.
I found a blog post about this issue, and it looks like it's finally a solved problem (with new-enough compiler versions).
John Regehr at the University of Utah recommends version "c" of his attempts at making a rotate function. I replaced his assert with a bitwise AND, and found that it still compiles to a single rotate insn.
typedef uint32_t rotwidth_t; // parameterize for comparing compiler output with various sizes
rotwidth_t rotl (rotwidth_t x, unsigned int n)
{
const unsigned int mask = (CHAR_BIT*sizeof(x)-1); // e.g. 31
assert ( (n<=mask) &&"rotate by type width or more");
n &= mask; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers
return (x<<n) | (x>>( (-n)&mask ));
}
rotwidth_t rot_const(rotwidth_t x)
{
return rotl(x, 7);
}
This could be templated on x's type, but it probably makes more sense for real use, to have the width in the function name (like rotl32
). Usually when you're rotating, you know what width you want, and that matters more than what size variable you're currently storing the value in.
Also make sure to only use this with unsigned types. Right-shift of signed types does an arithmetic shift, shifting in sign-bits. (It's technically implementation-dependent behaviour, but everything uses 2's complement now.)
Pabigot independently came up with the same idea before I did, and posted it at gibhub. His version has C++ static_assert checking to make it a compile-time error to use a rotate count outside the range for the type.
I tested mine with gcc.godbolt.org, with NDEBUG defined, for variable and compile-time-const rotate counts:
- gcc: optimal code with gcc >= 4.9.0, non-branching neg+shifts+or with earlier.
(compile-time const count: gcc 4.4.7 is fine) - clang: optimal code with clang >= 3.5.0, non-branching neg+shifts+or with earlier.
(compile-time const rotate count: clang 3.0 is fine) - icc 13: optimal code.
(compile-time const count with -march=native: generates slowershld $7, %edi, %edi
. Fine without-march=native
)
Even newer compiler versions can handle the commonly-given code from wikipedia (included in the godbolt sample) without generating a branch or cmov. John Regehr's version has the advantage of avoiding undefined behaviour when the rotate count is 0.
There are some caveats with 8 and 16 bit rotates, but compilers seem fine with 32 or 64, when n
is uint32_t
. See the comments in the code on the godbolt link for some notes from my testing various widths of uint*_t
. Hopefully this idiom will be better-recognized by all compilers for more combinations of type widths in the future. Sometimes gcc will uselessly emit an AND
insn on the rotate count, even though the x86 ISA defines the rotate insns with that exact AND as the first step.
"optimal" means as efficient as:
# gcc 4.9.2 rotl(unsigned int, unsigned int):
movl %edi, %eax
movl %esi, %ecx
roll %cl, %eax
ret
# rot_const(unsigned int):
movl %edi, %eax
roll $7, %eax
ret
When inlined, the compiler should be able to arrange for values to be in the right registers in the first place, resulting in just a single rotate.
With older compilers, you'll still get ideal code when the rotate count is a compile-time constant. Godbolt lets you test with ARM as a target, and it used a rotate there, too. With variable counts on older compilers, you get a bit of code bloat, but no branches or major performance problems, so this idiom should be safe to use in general.
BTW, I modified John Regehr's original to use CHAR_BIT*sizeof(x), and gcc / clang / icc emit optimal code for uint64_t
as well. However, I did notice that changing x
to uint64_t
while the function return type is still uint32_t
makes gcc compile it to shifts/or. So be careful to cast the result to 32bits in a separate sequence point, if you want the low 32b of a 64b rotate. i.e. Assign the result to a 64bit variable, then cast/return it. icc still generates a rotate insn, but gcc and clang don't, for
// generates slow code: cast separately.
uint32_t r = (uint32_t)( (x<<n) | (x>>( -n&(CHAR_BIT*sizeof(x)-1) )) );
If anyone can test this with MSVC, it would be useful to know what happens there.
Solution 2:
You can add one additional modulo operation to prevent the shifting by 32 bits, but I'm not convinced this is faster than using an if check in conjunction with branch predictors.
template <class T> inline T rotlMod(T x, unsigned int y)
{
y %= sizeof(T)*8;
return T((x<<y) | (x>>((sizeof(T)*8-y) % (sizeof(T)*8))));
}
Solution 3:
Writing the expression as T((x<<y) | ((x>>(sizeof(T)*CHAR_BITS-y-1)>>1))
should yield defined behavior for all values of y
below the bit size, assuming that T
is an unsigned type with no padding. Unless the compiler has a good optimizer, the resulting code may not be as good as what would have been produced by your original expression. Having to put up with clunky hard to read code which will yield slower execution on many compilers is part of the price of progress, however, since a hyper-modern compiler which is given
if (y) do_something();
return T((x<<y) | (x>>(sizeof(T)*8-y)));
might improve the "efficiency" of the code by making the call to do_something
unconditional.
PS: I wonder if there are any real-world platforms where changing the definition of shift-right so that x >> y
when y
is precisely equal to the bit size of x
, would be required to yield either 0 or x, but could make the choice in an arbitrary (unspecified) fashion, would require the platform to generate extra code or would preclude genuinely useful optimizations in non-contrived scenarios?