Is my fma() broken?

It is Cygwin's fault. Or more correctly, the newlib C library it uses. It explicitly says it does not even try to get fma() emulation correct.

GNU C library has correct emulation for almost all fma variants since 2015. For details, and the patches used to implement this, see the sourceware bug 13304.

If efficiency is not a problem, then I would simply use e.g.

#if defined(__CYGWIN__) && !defined(__FMA__) && !defined(__FMA3__) && !defined(__FMA4__)
#define fma(x, y, z)  fma_emulation(x, y, z)

double fma_emulation(double x, double y, double z)
{
    /* One of the implementations linked above */
}
#endif

I do not personally use Windows at all, but if anyone does (use Windows and need the fma emulation), I'd suggest they try and push a patch upstream, with a link to the GNU C library discussion on correct fma emulation.


What I am wondering, is whether it would be possible to examine just the low M bits of the result (discarded in the rounding) to determine the correct value of the ULP in the result, and adjust the result obtained using the straightforward a×b+c operation accordingly, using nextafter(); rather than using multiprecision arithmetic to implement the entire operation.

Edit: No, because the addition may overflow, dropping an extra bit as the MSB of the discarded part. For that reason alone, we do need to do the entire operation. Another reason is that if a×b and c have different signs, then instead of addition, we substract smaller in magnitude from larger in magnitude (result using the larger's sign), which may lead to clearing several high bits in the larger, and that affects which bits of the entire result are dropped in the rounding.

However, for the IEEE-754 Binary64 double on x86 and x86-64 architectures, I do believe fma emulation using 64-bit (integer) registers and 128-bit product, is still quite feasible. I shall experiment on a representation where low 2 bits in a 64-bit register are used for the rounding decision bits (LSB being a logical OR of all the dropped bits), 53 bits used for the mantissa, and one carry bit, leaving 8 unused and ignored high bits. The rounding is performed when the unsigned integer mantissa is converted to a (64-bit) double. If these experiments yield anything useful, I shall describe them here.


Initial findings: fma() emulation on a 32-bit system is slow. The 80-bit stuff on the 387 FPU is basically useless here, and implementing the 53×53-bit multiplication (and bit shifting) on a 32-bit system is just .. not worth the effort. The glibc fma() emulation code linked to above is good enough in my opinion.

Additional findings: Handling non-finite values is nasty. (Subnormals are only slightly annoying, requiring special handling (as the implicit MSB in mantissa is zero then).) If any of the three arguments is non-finite (infinity or some form of NaN), then returning a*b + c (not fused) is the only sane option. Handling these cases require additional branching, which slows down the emulation.

Final decision: The number of cases to handle in an optimized manner (rather than using the multiprecision "limb" approach as used in the glibc emulation) is large enough to make this approach not worth the effort. If each limb is 64-bit, each of a, b, and c is spread over at most 2 limbs, and a×b over three limbs. (With 32-bit limbs, that is just 3 and 5 limbs, respectively.) Depending on whether a×b and c have the same or different signs, there are only two fundamentally different cases to handle -- in the different signs case, the addition turns into subtraction (smaller from larger, result getting the same sign as the larger value).

In short, the multiprecision approach is better. The actual precision needed is very well bounded, and does not need even dynamic allocation. If the product of the mantissas of a and b can be calculated efficiently, the multiprecision part is limited to holding the product and handling the addition/subtraction. Final rounding can be done by converting the result to a 53-bit mantissa, exponent, and two extra low bits (the higher being the most significant bit lost in the rounding, and the lower being an OR of the rest of the bits lost in the rounding). Essentially, the key operations can be done using integers (or SSE/AVX registers), and the final conversion from a 55-bit mantissa to double handles the rounding according to current rules.