exact representation of floating points in c

basically with float you get 32 bits that encode

VALUE   = SIGN * MANTISSA * 2 ^ (128 - EXPONENT)
32-bits = 1-bit  23-bits               8-bits

and that is stored as

MSB                    LSB
[SIGN][EXPONENT][MANTISSA]

since you only get 23 bits, that's the amount of "precision" you can store. If you are trying to represent a fraction that is irrational (or repeating) in base 2, the sequence of bits will be "rounded off" at the 23rd bit.

0.7 base 10 is 7 / 10 which in binary is 0b111 / 0b1010 you get:

0.1011001100110011001100110011001100110011001100110011... etc

Since this repeats, in fixed precision there is no way to exactly represent it. The same goes for 0.8 which in binary is:

0.1100110011001100110011001100110011001100110011001101... etc

To see what the fixed precision value of these numbers is you have to "cut them off" at the number of bits you and do the math. The only trick is you the leading 1 is implied and not stored so you technically get an extra bit of precision. Because of rounding, the last bit will be a 1 or a 0 depending on the value of the truncated bit.

So the value of 0.7 is effectively 11,744,051 / 2^24 (no rounding effect) = 0.699999988 and the value of 0.8 is effectively 13,421,773 / 2^24 (rounded up) = 0.800000012.

That's all there is to it :)


A good reference for this is What Every Computer Scientist Should Know About Floating-Point Arithmetic. You can use higher precision types (e.g. double) or a Binary Coded Decimal (BCD) library to achieve better floating point precision if you need it.


The internal representation is IEE754.

You can also use this calculator to convert decimal to float, I hope this helps to understand the format.