Accurate floating-point linear interpolation

I want to perform a simple linear interpolation between $A$ and $B$ (which are binary floating-point values) using floating-point math with IEEE-754 round-to-nearest-or-even rounding rules, as accurately as possible. Please note that speed is not a big concern here.

I know of two basic approaches. I'll use the symbols $\oplus, \ominus, \otimes, \oslash$ following Knuth [1], to mean floating-point addition, subtraction, product and division, respectively (actually I don't use division, but I've listed it for completeness).

(1) $\quad f(t) = A\,\oplus\,(B\ominus A)\otimes t$

(2) $\quad f(t) = A\otimes(1\ominus t)\,\oplus \,B\otimes t$

Each method has its pros and cons. Method (1) is clearly monotonic, which is a very interesting property, while it is not obvious at all to me that that holds for method (2), and I suspect it may not be the case. On the other hand, method (2) has the advantage that when $t = 1$ the result is exactly $B$, not an approximation, and that is also a desirable property (and exactly $A$ when $t = 0$, but method (1) does that too). That follows from the properties listed in [2], in particular:

$u\oplus v = v\oplus u$

$u\ominus v = u\oplus -v$

$u\oplus v = 0$ if and only if $v = -u$

$u\oplus 0 = u$

$u\otimes 1 = u$

$u\otimes v = 0$ if and only if $u = 0$ or $v = 0$

In [3] Knuth also discusses this case:

$u' = (u\oplus v)\ominus v$

which implicitly means that $u'$ may or may not be equal to $u$. Replacing $u$ with $B$ and $v$ with $-A$ and using the above rules, it follows that it's not granted that $A\oplus(B\ominus A) = B$, meaning that method (1) does not always produce $B$ when $t = 1$.

So, here come my questions:

  1. Is method (2) guaranteed to be monotonic?
  2. If not, is there a better method that is accurate, monotonic and yields $A$ when $t = 0$ and $B$ when $t = 1$?
  3. If not (or you don't know), does method (1) when $t = 1$ always overshoot (that is, $A\oplus(B\ominus A)=A+(B-A)\cdot t$ for some $t \geq 1$)? Always undershoot (ditto for some $t \leq 1$)? Or sometimes overshoot and sometimes undershoot?

I assume that if method (1) always undershoots, I can make a special case when $t = 1$ to obtain the desired property of being exactly equal to $B$ when $t = 1$, but if it always overshoots, then I can't. That's the reason for question 3.

EDIT: I've found that the answer to question 3 is that it sometimes overshoots and sometimes undershoots. For example, in double precision:

-0x1.cae164da859c9p-1 + (0x1.eb4bf7b6b2d6ep-1 - (-0x1.cae164da859c9p-1))

results in 0x1.eb4bf7b6b2d6fp-1, which is 1 ulp greater than the original, while

-0x1.be03888ad585cp-1 + (0x1.0d9940702d541p-1 - (-0x1.be03888ad585cp-1))

results in 0x1.0d9940702d540p-1, which is 1 ulp less than the original. On the other hand, the method that I planned (special casing $t=1$) won't fly, because I've found it can be the case where $t < 1$ and $A\oplus(B\ominus A)\otimes t > B$, for example:

t = 0x1.fffffffffffffp-1
A = 0x1.afb669777cbfdp+2
B = 0x1.bd7b786d2fd28p+1

$A \oplus (B \ominus A)\otimes t =\,$ 0x1.bd7b786d2fd29p+1

which means that if method (1) is to be used, the only strategy that may work is clamping.

Update: As noted by Davis Herring in a comment and later checked by me, special casing t=1 actually works.


References

[1] D.E.Knuth, The Art of Computer Programming, vol. 2: Seminumerical algorithms, third edition, p. 215

[2] Op. cit. pp. 230-231

[3] Op. cit. p.235 eq.(41)


Solution 1:

1) Method 2 is not always monotonic. Counterexample in double precision:

$$A = B = 4000\quad t=\tt{0x1.b2f3db7800a39p-2}$$

From there, it results that:

$$1\ominus t=\tt{0x1.26861243ffae4p-1}$$

$$4000\otimes \tt{0x1.26861243ffae4p-1}\oplus 4000\otimes \tt{0x1.b2f3db7800a39p-2}=\tt{0x1.f400000000001p+11}\approx 4000.0000000000005$$

(obviously, when $t = 0$ and $t = 1$ the result equals 4000, so it ascends then descends, therefore it's not monotonic).

2) The second question asks for a better method. Here is a candidate:

$$lerp_{A,B}(t)=\begin{cases}A\oplus(B\ominus A)\otimes t&& \text{if}\,\ t<0.5 \\ B\ominus(B\ominus A)\otimes(1\ominus t)&&\text{otherwise}\end{cases}$$

This method has the following properties:

  • It matches the endpoints (obvious).
  • Both halves are monotonic (obvious).
  • Exact when $A=B$ (obvious)
  • Seems to be monotonic in the change of intervals.

The latter is the only part that is not proven. It turns out that there are many values of $A$ and $B$ with $A<B$ for which $A\oplus(B\ominus A)\otimes 0.5>B\ominus(B\ominus A)\otimes 0.5$; however, after testing many billions of single- and double-precision numbers, I could not find a single case violating either of these implications:

Let $s=0.5-\mathrm{ulp}(0.5)/2\quad$(the number immediately preceding 0.5), then

$A<B\implies A\oplus(B\ominus A)\otimes s\le B\ominus(B\ominus A)\otimes 0.5$

$A>B\implies A\oplus(B\ominus A)\otimes s\ge B\ominus(B\ominus A)\otimes 0.5$

Update: The reason it works seems to have to do with this fact: barring underflow, for any positive binary floating point number $u, u\otimes s < u/2$, and similarly for negative $u$ changing the inequality direction.

Or put another way, under the same assumption, for base 2 and precision $p, u\otimes\frac{(2^p-1)}{2^p} <u$. It can be shown that the bit that determines rounding (the bit next to the last bit of the mantissa) is always zero when performing that product, therefore the product is always rounded towards zero. That multiplication has the effect of subtracting 1 ulp from the number (or 1/2 ulp if the input number has a mantissa of 1.0).


I've deleted my two other proposals after discovering that both violated monotonicity.

Solution 2:

Method 2 is monotonic, if you are using round-to-even (which is fortunately the default).

Let's consider $A=B=1$ and half-precision numbers ('cause they're short), and $t=5/2^{12}$: t = 0.000000000101 1-t = 0.111111111011 But that's too long - we only get 11 bits. What value we actually get depends on what rounding mode we're in.

in "towards 0" (truncate) and "towards $-\infty$" (floor) modes:

1-t = 0.11111111101

in the other modes, "towards $\infty$" (ceiling), "ties away from 0", and "ties to even":

1-t = 0.11111111110

Now let's add them back together.

truncate and floor:$t+(1-t)=0.11111111111(1)=0.11111111111<1$

ceiling and ties away from 0: $t+(1-t)=1.0000000000(1) = 1.0000000001>1$

ties to even: $t+(1-t)=1.0000000000(1)=1.0000000000=1$

Some more analysis tells us what's going on: the goal is to have the two rounding steps counteract each other. This never happens with floor/truncate/ceiling. Most of the time it happens with ties away from zero, but in the situation where there is a tie, both rounding steps bias the result upward. With rounds-to-even, the rounding steps are always opposite each other: for ones that round down during the $1-t$ step ($3/2^{12}$ for instance), they'll round up during the addition step, and vice versa.

Solution 3:

The exact interpolated value can be represented as the arithmetic sum of 5 floating-point values: one for the approximated result, one for the error in each operation [1] and one for the product error when multiplying the subtraction error by t. Here's a C version of the algorithm in single precision; for simplicity, it uses fused multiply-add to calculate the product error:

#include <math.h>

float exact_sum(float a, float b, float *err)
{
    float sum = a + b;
    float z = sum - a;
    *err = a - (sum - z) + (b - z);
    return sum;
}

float exact_mul(float a, float b, float *err)
{
    float prod = a * b;
    *err = fmaf(a, b, -prod);
    return prod;
}

float exact_lerp(float A, float B, float t,
                 float *err1, float *err2, float *err3, float *err4)
{
    float diff = exact_sum(B, -A, err1);
    float prod = exact_mul(diff, t, err2);
    *err1 = exact_mul(*err1, t, err4);
    return exact_sum(A, prod, err3);
}

In order for this algorithm to work, operations need to conform to IEEE-754 semantics in round-to-nearest-or-even mode. That's not guaranteed by the C standard, but the GNU gcc compiler can be instructed to do so, at least in processors supporting SSE2 [2][3].

It is guaranteed that the arithmetic addition (as opposed to floating-point addition) of $result + err_1 + err_2 + err_3 + err_4$ will be equal to the desired result; however, there is no guarantee that it will even be a floating-point number. I'm not aware of any guarantees for partial sums of these. $err_4$ is expected to be generally much smaller than the others when they are nonzero.

For example: exact_lerp(0.23456789553165435791015625f, 7.345678806304931640625f, 0.300000011920928955078125f, &err1, &err2, &err3, &err4) returns $2.3679010868072509765625$ and the errors are $6.7055225372314453125\cdot 10^{-8}$, $8.4771045294473879039287567138671875\cdot 10^{-8}$, $1.490116119384765625\cdot 10^{-8}$ and $2.66453525910037569701671600341796875\cdot 10^{-15}$ respectively. These numbers add up to the exact result, which is $2.36790125353468550173374751466326415538787841796875$ (not a single-precision float, though in this case it happens to fit in a double-precision one).

All numbers in the example above are written using their exact values, rather than a number that approximates to them. All but the last are single-precision floats.

If the goal is precision, I would expect that calculating $err_1 \oplus err_2 \oplus err_3 \oplus err_4 \oplus result$ (in left-to-right order) gives a better approximation to the actual result than just using $result$ alone. In the example above, it gives $2.367901325225830078125$ which matches the rounded value of the exact result. I haven't made any testing as for monotonicity. $t=0$ matches the endpoint, and probably $t=1$ does too.

Not sure if it's worth adding $err_4$ at all, given how small its contribution is.

References

  • [1] Graillat, Stef (2007). Accurate Floating Point Product and Exponentiation.
  • [2] https://stackoverflow.com/questions/7295861/enabling-strict-floating-point-mode-in-gcc
  • [3] Semantics of Floating Point Math in GCC