What types of numbers are representable in binary floating-point?

I've read a lot about floats, but it's all unnecessarily involved. I think I've got it pretty much understood, but there's just one thing I'd like to know for sure:

I know that, fractions of the form 1/pow(2,n), with n an integer, can be represented exactly in floating point numbers. This means that if I add 1/32 to itself 32 million times, I would get exactly 1,000,000.

What about something like 1/(32+16)? It's one over the sum of two powers of two, does this work? Or is it 1/32+1/16 that works? This is where I'm confused, so if anyone could clarify that for me I would appreciate it.

Solution 1:

The rule can be summed up as this:

A number can be represented exactly in binary if the prime factorization of the denominator contains only 2. (i.e. the denominator is a power-of-two)

So 1/(32 + 16) is not representable in binary because it has a factor of 3 in the denominator. But 1/32 + 1/16 = 3/32 is.

That said, there are more restrictions to be representable in a floating-point type. For example, you only have 53 bits of mantissa in an IEEE double so 1/2 + 1/2^500 is not representable.

So you can do sum of powers-of-two as long as the range of the exponents doesn't span more than 53 powers.

To generalize this to other bases:

A number can be exactly represented in base 10 if the prime factorization of the denominator consists of only 2's and 5's.
A rational number X can be exactly represented in base N if the prime factorization of the denominator of X contains only primes found in the factorization of N.

Solution 2:

A finite number can be represented in the common IEEE 754 double-precision format if and only if it equals M•2^e for some integers M and e such that -2⁵³ < M < 2⁵³ and -1074 ≤ e ≤ 971.

For single precision, -2²⁴ < M < 2²⁴ and -149 ≤ e ≤ 104.

For double-precision, these are consequences of the facts that the double-precision format uses 52 bits to store a significand (which normally has 53 bits due to an implicit 1) and uses 11 bits to store an exponent. 11 bits encodes numbers from 0 to 2047, but 0 and 2047 are excluded for special purposes, and the encoded number is biased by 1023, so it represents unbiased exponents from -1022 to 1023. However, these unbiased exponents are for significands in the interval [1, 2), and those significands have fractions. To express the significand as an integer, I adjusted the exponent range by 52. Single-precision is similar, with 23 bits to store a 24-bit significand, 8 bits for the exponent, and a bias of 127.

Expressing the representable numbers using an integer times a power of two rather than the more common fractional significand simplifies some number theory and other reasoning about floating-point properties. I used it in this answer because it allows the set of representable values to be expressed concisely.

Solution 3:

Floating-point numbers are literally represented using the form:

1.m * 2^e

Where 1.m is a binary fraction and e is a positive or negative integer.

As such, you can represent 1/32 + 1/16 exactly, as:

1.1000000 * 2^-4

(1.10 being the binary fraction equivalent to 1.5.) 1/48, however, is not representable in this format.

What types of numbers are representable in binary floating-point?

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts