Marathoncode: Floating-point arithmetic with arbitrary

sexta-feira, 13 de abril de 2012

Floating-point arithmetic with arbitrary

1. Motivation
The floating-point arithmetic used in computers is prone to some errors related to the way that real numbers are represented (binary base) and also related to the size of the representation (32 bits or 64 bits). These two limiting factors often cause errors in precision calculations performed with floating point numbers.

As in the decimal base, certain real numbers has an infinite representation of digits. Even some numbers with finite decimal base representation can have an infinite binary base representation. As an example, the decimal representation is 0.1 is 0.0(0011) in binary base representation. The numbers in brackets is repeated indefinitely. Moreover, the decimal representation of 0.125 is finite in the floating-point binary base 0.001.

A floating point number is stored as follows (+ | -) s * 2 ^ e. The s represents the sginificante. The significant can have 24 or 53 binary digits.

For example, the number 0.1 is represented in the computer as follows:

1.1001100110011001100110011001100110011001100110011010 * 2 ^ -4

This representation will result in noticeable errors in simple calculations:

Consider the following C program:

double x = 1.0; 
  for (i = 0, i <100; i + +) { 
  x = x - 0.01; 
  }  
   printf ("% .15 lf \ n", x);

The output of this program will be:
-0.000000000000001

The idea presented in [1] is to calculate precision error associated with an operation to store it and spread it with the calculation result. Thus, the result of an operation becomes not only a floating-point and becomes an expansion e = en + en-1 + ... + e2 + e1, where each component ei is a floating-point number.

The addition operation of two floating point numbers is as follows:

# Define Fast_Two_Sum_Tail (a, b, x, y) \
 bvirt = x - a, \
 y = b - bvirt
 # Define Fast_Two_Sum (a, b, x, y) \
 x = (REAL) (a + b) \
 Fast_Two_Sum_Tail (a, b, x, y)

  The expansion and generated by the addition of a and b is x + y.
   alg1

Figure extracted doctoral qualification Markos Oliveira Freitas

The sum of expansion with a floating-point number is as follows:

alg2

Figure extracted doctoral qualification Markos Oliveira Freitas

Using this idea, the algorithm is as follows:

int grow_expansion (elen, e, b, h) / * and Can Be and h being the same.  * / 
  int elen / / size of the expansion 
  REAL * and / / expansion 
  REAL b / / floating-point add 
  REAL * h / / output expansion

and [0] = 1.0; 
  h [0] = 0.0; 
  for (i = 0, i <100; i + +) { 
  elen = grow_expansion (elen, e, - 0.01, h); 
  expansioncopy (elen, and h); 
  } 
    printf ("% .15 lf \ n", estimate (elen, e));

The result of the program will

0.000000000000000

The C code with the operations can be find here .

References
[1] JR Shewchuk. Adaptive precision floating-point arithmetic and fast robust geometric predicates . Computational Geometry & Discrete, 18 (3).

Nenhum comentário:

Postar um comentário