Does the C++ standard specify anything on the representation of floating point numbers?
For types T
for which std::is_floating_point<T>::value
is true
, does the C++ standard specify anything on the way that T
should be implemented?
For example, does T
has even to follow a sign/mantissa/exponent representation? Or can it be completely arbitrary?
From N3337:
[basic.fundamental/8]:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.
If you want to check if your implementation uses IEEE-754, you can use std::numeric_limits::is_iec559
:
static_assert(std::numeric_limits<double>::is_iec559,
"This code requires IEEE-754 doubles");
There are a number of other helper traits in this area, such as has_infinity
, quiet_NaN
and more.
The C standard has an "annex" (in C11 it's Annex F) which lays out what it means for an implementation of C to be compliant with IEC 60559, the successor standard to IEEE 754. An implementation that conforms to Annex F must have IEEE-representation floating point numbers. However, implementing this annex is optional; the core standard specifically avoids saying anything about the representation of floating point numbers.
I do not know whether there is an equivalent annex for C++. It doesn't appear in N3337, but that might just mean it's distributed separately. The existence of std::numeric_limits<floating-type>::is_iec559
indicates that the C++ committee at least thought about this, but perhaps not in as much detail as the C committee did. (It is and has always been a damned shame that the C++ standard is not expressed as a set of edits to the C standard.)
No particular implementation is required. The C++ standard doesn't talk about it much at all. The C standard goes into quite a bit of detail about the conceptual model assumed for floating point numbers, with a sign, exponent, significand in some base b
, and so on. It, however, specifically states that this is purely descriptive, not a requirement on the implementation (C11, footnote 21):
The floating-point model is intended to clarify the description of each floating-point characteristic and does not require the floating-point arithmetic of the implementation to be identical.
That said, although the details can vary, at least offhand it seems to me that producing (for example) a conforming implementation of double
that didn't fit fairly closely with the usual model (i.e., a significand and exponent) would be difficult (or at least difficult to do with competitive performance, anyway). It wouldn't be particularly difficult to have it vary in other ways though, such as rearranging the order, or using a different base.
The definition of std::numeric_limits<T>::digits
(and std::numeric_limits<T>::digits10
) imply fairly directly that what's listed as a floating point type must retain (at least approximately) the same precision for all numbers across a fairly wide range of magnitudes. By far the most obvious way to accomplish that is to have some number of bits/digits devoted to a significand, and some other (separate) set of bits devoted to an exponent.
The idea of std::is_floating_point
is to make user code of different origin work together better. Technically you can specify an int
as std::is_floating_point
without causing undefined behavior. But say you have some templated library that has to repeatedly divide by T n
. To speed things up the library creates a T ni = 1 / n
and replaces division by n
by multiplication by ni
. This works great for floating point numbers, but fails for integers. Therefore the library correctly only does the optimization if std::is_floating_point<T>::value == true
. If you lie the code probably still works from the standard's point of view, but is incorrect from a logical point of view. So if you write a class that behaves like a bigger float
mark it as std::is_floating_point
, otherwise don't. This should get you both optimal and correct code.