Getting the high part of 64 bit integer multiplication

In C++, say that:

uint64_t i;
uint64_t j;

then i * j will yield an uint64_t that has as value the lower part of the multiplication between i and j, i.e., (i * j) mod 2^64. Now, what if I wanted the higher part of the multiplication? I know that there exists an assembly instruction do to something like that when using 32 bit integers, but I am not familiar at all with assembly, so I was hoping for help.

What is the most efficient way to make something like:

uint64_t k = mulhi(i, j);

If you're using gcc and the version you have supports 128 bit numbers (try using __uint128_t) then performing the 128 multiply and extracting the upper 64 bits is likely to be the most efficient way of getting the result.

If your compiler doesn't support 128 bit numbers, then Yakk's answer is correct. However, it may be too brief for general consumption. In particular, an actual implementation has to be careful of overflowing 64 bit integers.

The simple and portable solution he proposes is to break each of a and b into 2 32-bit numbers and then multiply those 32 bit numbers using the 64 bit multiply operation. If we write:

uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;

then it is obvious that:

a = (a_hi << 32) + a_lo;
b = (b_hi << 32) + b_lo;

and:

a * b = ((a_hi << 32) + a_lo) * ((b_hi << 32) + b_lo)
      = ((a_hi * b_hi) << 64) +
        ((a_hi * b_lo) << 32) +
        ((b_hi * a_lo) << 32) +
          a_lo * b_lo

provided the calculation is performed using 128 bit (or greater) arithmetic.

But this problem requires that we perform all the calculcations using 64 bit arithmetic, so we have to worry about overflow.

Since a_hi, a_lo, b_hi, and b_lo are all unsigned 32 bit numbers, their product will fit in an unsigned 64 bit number without overflow. However, the intermediate results of the above calculation will not.

The following code will implement mulhi(a, b) when the mathemetics must be performed modulo 2^64:

uint64_t    a_lo = (uint32_t)a;
uint64_t    a_hi = a >> 32;
uint64_t    b_lo = (uint32_t)b;
uint64_t    b_hi = b >> 32;

uint64_t    a_x_b_hi =  a_hi * b_hi;
uint64_t    a_x_b_mid = a_hi * b_lo;
uint64_t    b_x_a_mid = b_hi * a_lo;
uint64_t    a_x_b_lo =  a_lo * b_lo;

uint64_t    carry_bit = ((uint64_t)(uint32_t)a_x_b_mid +
                         (uint64_t)(uint32_t)b_x_a_mid +
                         (a_x_b_lo >> 32) ) >> 32;

uint64_t    multhi = a_x_b_hi +
                     (a_x_b_mid >> 32) + (b_x_a_mid >> 32) +
                     carry_bit;

return multhi;

As Yakk points out, if you don't mind being off by +1 in the upper 64 bits, you can omit the calculation of the carry bit.

TL:DR with GCC for a 64-bit ISA: (a * (unsigned __int128)b) >> 64 compiles nicely, to a single full-multiply or high-half multiply instruction. No need to mess around with inline asm.

Unfortunately current compilers don't optimize @craigster0's nice portable version, so if you want to take advantage of 64-bit CPUs, you can't use it except as a fallback for targets you don't have an #ifdef for. (I don't see a generic way to optimize it; you need a 128-bit type or an intrinsic.)

GNU C (gcc, clang, or ICC) has unsigned __int128 on most 64-bit platforms. (Or in older versions, __uint128_t). GCC doesn't implement this type on 32-bit platforms, though.

This is an easy and efficient way to get the compiler to emit a 64-bit full-multiply instruction and keep the high half. (GCC knows that a uint64_t cast to a 128-bit integer still has the upper half all zero, so you don't get a 128-bit multiply using three 64-bit multiplies.)

MSVC also has a __umulh intrinsic for 64-bit high-half multiplication, but again it's only available on 64-bit platforms (and specifically x86-64 and AArch64. The docs also mention IPF (IA-64) having _umul128 available, but I don't have MSVC for Itanium available. (Probably not relevant anyway.)

#define HAVE_FAST_mul64 1

#ifdef __SIZEOF_INT128__     // GNU C
 static inline
 uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int128 prod =  a * (unsigned __int128)b;
     return prod >> 64;
 }

#elif defined(_M_X64) || defined(_M_ARM64)     // MSVC
   // MSVC for x86-64 or AArch64
   // possibly also  || defined(_M_IA64) || defined(_WIN64)
   // but the docs only guarantee x86-64!  Don't use *just* _WIN64; it doesn't include AArch64 Android / Linux

  // https://docs.microsoft.com/en-gb/cpp/intrinsics/umulh
  #include <intrin.h>
  #define mulhi64 __umulh

#elif defined(_M_IA64) // || defined(_M_ARM)       // MSVC again
  // https://docs.microsoft.com/en-gb/cpp/intrinsics/umul128
  // incorrectly say that _umul128 is available for ARM
  // which would be weird because there's no single insn on AArch32
  #include <intrin.h>
  static inline
  uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int64 HighProduct;
     (void)_umul128(a, b, &HighProduct);
     return HighProduct;
  }

#else

# undef HAVE_FAST_mul64
  uint64_t mulhi64(uint64_t a, uint64_t b);  // non-inline prototype
  // or you might want to define @craigster0's version here so it can inline.
#endif

For x86-64, AArch64, and PowerPC64 (and others), this compiles to one mul instruction, and a couple movs to deal with the calling convention (which should optimize away after this inlines). From the Godbolt compiler explorer (with source + asm for x86-64, PowerPC64, and AArch64):

     # x86-64 gcc7.3.  clang and ICC are the same.  (x86-64 System V calling convention)
     # MSVC makes basically the same function, but with different regs for x64 __fastcall
    mov     rax, rsi
    mul     rdi              # RDX:RAX = RAX * RDI
    mov     rax, rdx
    ret

(or with clang -march=haswell to enable BMI2: mov rdx, rsi / mulx rax, rcx, rdi to put the high-half in RAX directly. gcc is dumb and still uses an extra mov.)

For AArch64 (with gcc unsigned __int128 or MSVC with __umulh):

test_var:
    umulh   x0, x0, x1
    ret

With a compile-time constant power of 2 multiplier, we usually get the expected right-shift to grab a few high bits. But gcc amusingly uses shld (see the Godbolt link).

Unfortunately current compilers don't optimize @craigster0's nice portable version. You get 8x shr r64,32, 4x imul r64,r64, and a bunch of add/mov instructions for x86-64. i.e. it compiles to a lot of 32x32 => 64-bit multiplies and unpacks of the results. So if you want something that takes advantage of 64-bit CPUs, you need some #ifdefs.

A full-multiply mul 64 instruction is 2 uops on Intel CPUs, but still only 3 cycle latency, same as imul r64,r64 which only produces a 64-bit result. So the __int128 / intrinsic version is 5 to 10 times cheaper in latency and throughput (impact on surrounding code) on modern x86-64 than the portable version, from a quick eyeball guess based on http://agner.org/optimize/.

Check it out on the Godbolt compiler explorer on the above link.

gcc does fully optimize this function when multiplying by 16, though: you get a single right shift, more efficient than with unsigned __int128 multiply.

Getting the high part of 64 bit integer multiplication

Related

Recent Posts