What types of numbers are representable in binary floating-point?
I've read a lot about floats, but it's all unnecessarily involved. I think I've got it pretty much understood, but there's just one thing I'd like to know for sure:
I know that, fractions of the form 1/pow(2,n)
, with n
an integer, can be represented exactly in floating point numbers. This means that if I add 1/32
to itself 32 million times, I would get exactly 1,000,000
.
What about something like 1/(32+16)
? It's one over the sum of two powers of two, does this work? Or is it 1/32+1/16
that works? This is where I'm confused, so if anyone could clarify that for me I would appreciate it.
Solution 1:
The rule can be summed up as this:
- A number can be represented exactly in binary if the prime factorization of the denominator contains only 2. (i.e. the denominator is a power-of-two)
So 1/(32 + 16)
is not representable in binary because it has a factor of 3 in the denominator. But 1/32 + 1/16 = 3/32
is.
That said, there are more restrictions to be representable in a floating-point type. For example, you only have 53 bits of mantissa in an IEEE double
so 1/2 + 1/2^500
is not representable.
So you can do sum of powers-of-two as long as the range of the exponents doesn't span more than 53 powers.
To generalize this to other bases:
A number can be exactly represented in base 10 if the prime factorization of the denominator consists of only 2's and 5's.
A rational number
X
can be exactly represented in baseN
if the prime factorization of the denominator ofX
contains only primes found in the factorization ofN
.
Solution 2:
A finite number can be represented in the common IEEE 754 double-precision format if and only if it equals M•2e for some integers M and e such that -253 < M < 253 and -1074 ≤ e ≤ 971.
For single precision, -224 < M < 224 and -149 ≤ e ≤ 104.
For double-precision, these are consequences of the facts that the double-precision format uses 52 bits to store a significand (which normally has 53 bits due to an implicit 1) and uses 11 bits to store an exponent. 11 bits encodes numbers from 0 to 2047, but 0 and 2047 are excluded for special purposes, and the encoded number is biased by 1023, so it represents unbiased exponents from -1022 to 1023. However, these unbiased exponents are for significands in the interval [1, 2), and those significands have fractions. To express the significand as an integer, I adjusted the exponent range by 52. Single-precision is similar, with 23 bits to store a 24-bit significand, 8 bits for the exponent, and a bias of 127.
Expressing the representable numbers using an integer times a power of two rather than the more common fractional significand simplifies some number theory and other reasoning about floating-point properties. I used it in this answer because it allows the set of representable values to be expressed concisely.
Solution 3:
Floating-point numbers are literally represented using the form:
1.m * 2^e
Where 1.m
is a binary fraction and e
is a positive or negative integer.
As such, you can represent 1/32 + 1/16
exactly, as:
1.1000000 * 2^-4
(1.10
being the binary fraction equivalent to 1.5.) 1/48
, however, is not representable in this format.