Re: Type theory vs floating-point arithmetic
> Re: Type theory vs floating-point arithmetic
> Achim Jung <A.Jung@cs.bham.ac.uk>
> Tue, 02 Sep 2003 12:15:57 +0100 (BST)
> Joe.Darcy@Sun.COM, Matthias Blume <email@example.com>
> Mathias has raised an interesting point which deserves further emphasis.
(I'm not on the types mailing list so I didn't see the message from
> For floating point numbers there are two possible interpretations and
> are not always compatible with each other:
> The "point interpretation": The floating point numbers represent a finite
> set of rational numbers.
> The "interval interpretation": The floating point numbers represent a
> finite set of intervals which together cover the whole of the real line.
The interval interpretation is not really workable model for plain
floating-point arithmetic, as implied by some of your later points.
Interval arithmetics are distinct numerical systems which can be built
on top of a floating-point arithmetic. Instead of operating on
"points," interval arithmetic operates on intervals like "the real
numbers strictly between 2 and 3." The basic correctness property of
interval arithmetic is containment; that is, the result interval must
contain the true point result. A narrow result interval indicates the
corresponding point computation is accurate. A wide result interval
indicates the corresponding point computation *may* be inaccurate.
The rounding modes required by IEEE 754 were included to make easy
implementations of interval arithmetics feasible.
The model used in the IEEE 754 standard is that a floating-point value
represents the corresponding exact real value (NaNs have no
corresponding numerical value). The description of the semantics of
the 754 arithmetic operations assumes the inputs are exact; the
floating-point value closest to the exact numerical value of the
computation is chosen as the result. The rounding process (usually)
maps a real number to the nearest floating-point value; all other
information about the originating value is lost, unless perhaps you
have sticky flag information.
You can actually perform exact computations with floating-point
arithmetic; the standard's inexact flag was included to allow a
program to notice that a rounding error occurred. (Of course, in
general to reliably use such a flag, the decimal -> binary conversion
process would have to set the inexact flag appropriately too.)
I think the common belief that floating-point can be modeled by
intervals stems from the numerical analysis treatments where
floating-point is related to real arithmetic by means of lots of Greek
letters representing rounding errors. However, that mapping to reals
doesn't fully explain floating-point. Below I'll briefly discuss the
relation between floating-point and forward and backwards error from
> In the point interpretation it is easy to see what the definition of the
> various operations should be: You return the floating point number
> closest to the exact result. A corresponding formulation for the interval
> interpretation (having either the same or a different effect on the
> implementation) is not so easy.
That is why true interval arithmetics return an interval range as a
> The interval interpretation, on the other hand, retains some information
> about the transition from "real" real numbers to the finite set of
> floating point numbers.
An interpretation may try to include this information; but the
floating-point values by themselves have no such information.
> This is helpful in understanding some aspects of
> the behaviour of the latter. For example, we can say that 0
> interval [0,0], whereas +0 represents the interval (0,\epsilon] where
> \epsilon is half-way between 0 and the smallest positive number
> representable. In this sense, +0 has a very useful role to play: It is a
> number which is positive (and not equal to 0) but too small to represent
> any digits for it. Likewise, +infinity stands for the interval (M,\infty)
> where M is half a unit of precision above the largest representable
> So it, too, represents perfectly reasonable numbers. NaN, finally,
> represents (-\infty,\infty) and should be the result of trying to add
> -infinity to +infinity as the answer could be anything. Division by zero
> should raise an exception, and not return NaN.
Interval arithmetics use various representation tricks of this sort;
however, there are a plethora of potential interval options including:
* Closed interval
* Open intervals
* Various definitions for dividing by an interval containing zero
* Whether or not infinity can be an endpoint
* Whether a "number" can be a set of intervals
* Representation options:
* endpoints, [a, b]
* midpoint and range
I'm not aware of any interval scheme that clearly describes the set of
mathematical objects being represented and uses an IEEE 754-style
completion of real intervals (infinity and NaN values, etc.). There
is a vast interval literature and various combinations of the above
options have been explored.
> Both interpretations have their difficulties: The point interpretation is
> misleading because the number displayed to the user can be different to
> the number represented in the machine.
How a floating-point number is displayed is an entirely separate
issue, although a common source of misunderstanding. All finite
binary floating-point numbers have *exact* decimal representations if
you allow enough digits. You need several hundred digits for very
small double values. Since printing out several hundred digits is
unwieldy, languages specify fewer digits on output; Java uses the
Steele-White technique of (roughly) using the string with the fewest
number of decimal digits which will allow the original floating-point
value to be recovered. This lets the user see "0.1" for the binary
double floating-point value closes to 0.1; the exact decimal value of
that number is
Using this short form is convenient, if potentially misleading to the
uninitiated. My esthetics may be warped by years of working on
floating-point; but I am very fond the C99's hexadecimal
floating-point strings for some purposes. For example, the value
"3.0" can be written as
that is "1.8" as a hex string (i.e. 1.5 decimal) times 2 to the 1. The
double floating-point value next larger than 3.0 is
and the number next smaller is
I find constructing these bordering values is much easier in hex than
Many of the surprises related to decimal <-> binary conversion are
discussed in my talk
"What Everybody Using the Java(TM) Programming Language Should Know
About Floating-Point Arithmetic," http://java.sun.com/people/darcy/
One of these days, I'll write-up this material for publication.
> So, for example, if you type in
> 1E40 ("ten to the forty") then many users of calculators and computers
> expect this to be one of the set of floating point numbers but because of
> the conversion into binary, and because of limited precision, it can not
> be represented exactly in either single, double, or extended double
The number "0.1" can't be exactly represented by a binary
floating-point format no matter how precise. If a binary
floating-point format had enough precision, 1E40 could be exactly
represented as it is just an integer.
> The interval interpretation is problematic because, as I mentioned at the
> beginning, it is not preserved by the operations. In other words, the
> interval associated to the result floating point number is not (and does
> not in general contain) the set of possible answers from numbers in the
> input intervals. Even addition fails this criterion in almost all cases.
> The good news, however, is that the interval interpretation can be taken
> as the starting point for "exact real computation" but this goes beyond
> simple floating point. Look at Martin Escardo's homepage for more
> to relevant work, or check out the proceedings of "Computability and
> Complexity in Analysis".
> The tension between the two interpretations can be illustrated by the
> question "What is sin(1E40)?": For the point interpretation, the sine
> function is one of the pleasant ones, because the answer is always
> -1 and 1, and so can be computed with high precision. Unfortunately, I
> don't know of any arithmetic coprocessor which goes to the trouble of
> actually computing the exact result for big input numbers (but please
> correct me if I am wrong).
The fdlibm algorithms used by Java compute accurate sines for the full
double range, all the way out to Double.MAX_VALUE, 10^308 or so. See
K.C., Ng, "Argument Reduction for Huge Arguments: Good to the Last
for details of how this can be implemented. The fsin and fcos
instructions as implemented on most x86 processors have an interesting
systematic flaw for large inputs; I can provide details if anyone is
> Furthermore, given that 1E40 is not exactly
> represented (against appearances), it is certainly wrong to return any
> result apart from NaN.
If an exact decimal to binary conversion was required to deliver
non-NaN floating-point results, nearly all floating-point results
would be NaN!
> In the interval interpretation, the answer should be the interval [-1,1]
> because the input interval contains many periods of \pi (even for
> double precision). Unfortunately, this is not one of the intervals
> available in any floating point system and so in this interpretation,
> the answer should be NaN.
I disagree. The double value closest to 1E40 is some particular
numerical value; the value has a hex representation of
0x1.d6329f1c35ca5p132. That particular numerical value has a
well-defined sine, e.g. from the Taylor expansion of sine. Therefore,
there is clearly a "correct" value for that input. The two adjacent
double values bracketing 1E40 are
Printed using Java conventions,
The difference between these two values is greater than 2*pi so the
sin of this interval would be [-1, 1]. In Java, the sine of
0x1.d6329f1c35ca5p132 is 0.6467845884268344.
To describe how the result of a floating-point computation
approximates the result of the desired computation on real numbers,
numerical analysis use forwards and backwards error.
When considering forwards error, you assume the input is exact, and
then determine how close the returned floating-point result is to the
exact real number result for those inputs. So the fdlibm sin routine
mentioned above has a small forward error; the relative error is
around 10^-16, the approximate decimal precision of double.
For backwards error, you assume the *result* is exact and the
determine how much the *inputs* might be perturbed to achieve the
observed result. The fdlibm sin routine has a (very) small backwards
error; anther sin routine could have a priori reasonable backwards
error bounds but return an entirely "wrong" result, e.g. a random
value in [-1, 1].
Building on the work of Wilkinson and others, backwards error analysis
is often used in numerical linear algebra. For example, the
surprising stability of Gaussian elimination with partial pivoting is
phrased in terms of backwards error in the input matrix. While
backwards error is a very important concept, it can be abused. A
colleague at Sun has noted that "backwards error is how numerical
analysts justify giving wrong results" ;-) A good reference for these
sorts of topics is Higham's "Accuracy and Stability of Numerical
Algorithms," which discusses forwards and backwards error in much more
Numerical analysis is not just a study of compensating for
floating-point rounding errors; even if exact arithmetic is used,
numerical analysis is still necessary. Two sources of numerical error
in a computation are floating-point rounding errors and "truncation
error," error from not using more terms of an expansion. For example,
even if each term is computed and summed exactly, if you want to use
the Taylor series to compute sine you will need some way to know you
have added together enough terms to get as close as desired to the
right answer. (Credible sin/cos implementations use a process called
argument reduction to map each input value into a corresponding value
in a chosen reduced range, e.g. [-pi/4, pi/4]. On that reduced range
a low-degree polynomial can be used to evaluate the function
accurately. This polynomial will generally not be the Taylor
polynomial. For more discussion and references see Muller's
"Elementary Functions: Algorithms and Implementations.")
One common example to illustrate the hazards of roundoff error is
computing the value of a polynomial near a repeated root using the
non-factored form of a polynomial, e.g. (x-1)^7 as x^7 + c1*x^6 +
... + -1. Near the root, the returned result widely fluctuates
between positive and negative values while the real function being
modeled is smooth. Therefore, the floating-point function's
approximation to the smooth polynomial is non-ideal. However, that
function which jumps around near the root as a set of input-output
mappings is a perfectly valid function on its own. The "error" is
defined in relation to some external "correct" value; if you for some
reason want the jagged function, there is no error.