problems in floating point comparison [duplicate]

void main()
{
    float f = 0.98;
    if(f <= 0.98)
        printf("hi");
    else
        printf("hello");
    getch();
}

I am getting this problem here.On using different floating point values of f i am getting different results. Why this is happening?


Solution 1:

f is using float precision, but 0.98 is in double precision by default, so the statement f <= 0.98 is compared using double precision.

The f is therefore converted to a double in the comparison, but may make the result slightly larger than 0.98.

Use

if(f <= 0.98f)

or use a double for f instead.


In detail... assuming float is IEEE single-precision and double is IEEE double-precision.

These kinds of floating point numbers are stored with base-2 representation. In base-2 this number needs an infinite precision to represent as it is a repeated decimal:

0.98 = 0.1111101011100001010001111010111000010100011110101110000101000...

A float can only store 24 bits of significant figures, i.e.

       0.111110101110000101000111_101...
                                 ^ round off here
   =   0.111110101110000101001000

   =   16441672 / 2^24

   =   0.98000001907...

A double can store 53 bits of signficant figures, so

       0.11111010111000010100011110101110000101000111101011100_00101000...
                                                              ^ round off here
   =   0.11111010111000010100011110101110000101000111101011100

   =   8827055269646172 / 2^53

   =   0.97999999999999998224...

So the 0.98 will become slightly larger in float and smaller in double.

Solution 2:

It's because floating point values are not exact representations of the number. All base ten numbers need to be represented on the computer as base 2 numbers. It's in this conversion that precision is lost.

Read more about this at http://en.wikipedia.org/wiki/Floating_point


An example (from encountering this problem in my VB6 days)

To convert the number 1.1 to a single precision floating point number we need to convert it to binary. There are 32 bits that need to be created.

Bit 1 is the sign bit (is it negative [1] or position [0]) Bits 2-9 are for the exponent value Bits 10-32 are for the mantissa (a.k.a. significand, basically the coefficient of scientific notation )

So for 1.1 the single floating point value is stored as follows (this is truncated value, the compiler may round the least significant bit behind the scenes, but all I do is truncate it, which is slightly less accurate but doesn't change the results of this example):

s --exp--- -------mantissa--------
0 01111111 00011001100110011001100

If you notice in the mantissa there is the repeating pattern 0011. 1/10 in binary is like 1/3 in decimal. It goes on forever. So to retrieve the values from the 32-bit single precision floating point value we must first convert the exponent and mantissa to decimal numbers so we can use them.

sign = 0 = a positive number

exponent: 01111111 = 127

mantissa: 00011001100110011001100 = 838860

With the mantissa we need to convert it to a decimal value. The reason is there is an implied integer ahead of the binary number (i.e. 1.00011001100110011001100). The implied number is because the mantissa represents a normalized value to be used in the scientific notation: 1.0001100110011.... * 2^(x-127).

To get the decimal value out of 838860 we simply divide by 2^-23 as there are 23 bits in the mantissa. This gives us 0.099999904632568359375. Add the implied 1 to the mantissa gives us 1.099999904632568359375. The exponent is 127 but the formula calls for 2^(x-127).

So here is the math:

(1 + 099999904632568359375) * 2^(127-127)

1.099999904632568359375 * 1 = 1.099999904632568359375

As you can see 1.1 is not really stored in the single floating point value as 1.1.