Computing with Floating Point Numbers

As mentioned earlier, computers cannot represent real numbers precisely since there are only a finite number of bits for storing a real number. Therefore, any number that has infinite number of digits such as 1/3, the square root of 2 and PI cannot be represented completely. Moreover, even a number of finite number of digits cannot be represented precisely. This is because of the way of encoding real numbers.

If you know the way of converting a real number (in base 10) to a real number in other bases, say base 2 or 16, you should be able to verify that 0.1 in base 10 is converted to 0.199999999..... in base 16 and 0.00011000110001100011..... in base 2. A simple real number is converted to a real number of infinite number of digits in base 2 and base 16. Since base 2 and base 16 are the two most frequently ways of encoding floating numbers, 0.1 in base 10 cannot be represented and stored exactly by those computers using base 2 and base 16 for floating point number computation. This is a big problem we have to live with (we cannot change this fact, except that we build base 10 computers again).

Loss of Significant Digits

As even more important problem is losing significant digits in computations, especially in subtractions. What is loss-of-significant-digit? Suppose we are using a base 10 computer and suppose this computer can store four significant digits. Adding 0.1234 and 0.9999 together yields 1.1233. Since this computer can only store four significant digits, the result will be either rounded or truncated (in this case truncated) before it can be stored into memory. Thus, in memory the result is 1.123, which is fine. Multiplications and division are the same as addition.

Consider subtraction. Subtracting 0.1234 from 0.1235 yields 0.0001. So, what is the problem? 0.1234 and 0.1235 are two numbers of four significant digits. After subtraction, the result 0.0001 has only 1 significant digit! That is, you have only one digit that you can trust. In a subsequent computation, say adding 0.1236 to this 0.0001, since the latter has only one digit you can trust, the result has one reliable digit only! Therefore, subtractions can cause the number of significant digits to decease and as a result should be avoided if it is possible.

Example 1

Let us take a look a familiar example: solving the following quadratic equation:

A commonly seen formula tells us the roots are

If you try a=1, b=-20000 and c=1, the roots computed with single precision are 20000 and 0. This is impossible since the product of the is equal to c/a, which is not zero. Where is the problem? If b is very large and both a and c are very small as in the above case, the value of b*b-4ac is almost identical to the square of b and as a result, the sum of the square root of this value and -b is zero. While we cannot remove the subtraction in the square root, we can do something for the one between b and the square root. The following offers two such methods:

Click here to download a C program of this example.

Example 2

Let us computing the value of sine using a well-known infinite series

Once you see the above series, you perhaps has a feeling that something weird is about to happen, because there are too many subtractions! Yes, it is the case. If we program the sine function as follows:

float  MySin(float x)
{
     float  sinx  = 0.0;
     float  term  = x;
     float  sign  = 1.0;
     float  denom = 1.0;

     while (fabs(term) >= EPSILON) {
          sinx = sinx + sign*term;
          term = term*x/(++denom);
          term = term*x/(++denom);
          sign = -sign;
     }
     return  sinx;
}
If the tolerance value for EPSILON is 1.0E-8, we have the following result for x being -30, -25, -20, ..., 20. 25 and 30:
 x             MySin          sin(x)
---            -----          ------
-30.000000     33002.515625   0.988032
-25.000000     270.567413     0.132352
-20.000000     -0.957176      -0.912945
-15.000000     -0.677538      -0.650288
-10.000000     0.544120       0.544021
-5.000000      0.958925       0.958924
0.000000       0.000000       0.000000
5.000000       -0.958925      -0.958924
10.000000      -0.544120      -0.544021
15.000000      0.677538       0.650288
20.000000      0.957176       0.912945
25.000000      -270.567413    -0.132352
30.000000      -33002.515625  -0.988032
In the above, the second column is the output of the above program while the third column is computed with the library function sin(). When the value of x is near to zero, the results are almost the same. As the value of x is getting larger, the effect of losing significant digits becomes so severe that the value of the function is not in the range of 0 and 1.

Note that the library function uses some other way to compute the values of sine.

Click here to download a C program of this example. You may want to try adding all positive and negative terms to two variables and using one subtraction to get the result. Compare your result with the bad one above. What do you get?

Mathematical Laws Sometimes Do Not Work for Finite Precision Computation

This is to warn you that those mathematical laws you are familiar with sometime do not work very well for finite precision computation. Carefully choosing the order of evaluating your expressions are important. We shall talk about commutative law and associative law only.

The commutative law states that the results of a*b and b*a are the same (in mathematics) for addition and multiplication operators. You have to be careful when you apply this law. Consider computing a*b*c, where a and c are very large numbers and b is very small so that a*b and b*c are close to 1. In this case, computing a*b*c would be fine.

Now let us exercise the commutative law by changing the expression to a*c*b. What would happen? You could get an overflow because a*c could be too large to be represented by the hardware!

Associative law does not work well either. The associative law states that one can evaluate a*b*c either by (a*b)*c or by a*(b*c). It looks innocent; but could be damaging.

Consider the following recurrence relations

where the value for R is 3.0 and the initial value for x is 0.5. These five recurrence relations are equivalent mathematically (you can verify it easily). If we compute the values of x up to the 1000th term and display the results every 1000 terms, the following contains shocking results:

    0           0.5           0.5           0.5           0.5           0.5
  100    1.6782e-05       1.30277      0.506627       1.14111      0.355481
  200       1.28383       0.97182       1.25677       1.04532      0.805295
  300      0.745989      0.155282       1.32845      0.295785     0.0590719
  400     0.0744125      0.354702       1.00514      0.882786      0.171617
  500      0.788592    0.00176679      0.804004      0.496569       1.32348
  600     0.0190271       0.35197      0.832942      0.518201       1.20776
  700      0.377182      0.525322       0.31786      0.435079      0.525364
  800        0.1293   0.000277146       1.33159     0.0130949      0.954542
  900      0.375104    0.00743761       1.27548      0.409476    0.00996984
 1000      0.680855       1.03939      0.462035     0.0734742       1.26296
All five ways of using associative law give total different behavior and this happens very early: the 100th terms are very different!

Click here to download a C program of this example.

We shall return to this topic with an eye on the impact on geometric problems.