RE: Type theory vs floating-point arithmetic


I work as the "Java Floating-Point Czar" at Sun and I'm acting editor
for the IEEE 754 revision committee.  A number of colleagues forwarded
Mr. Sweeney's message to me and I have a few comments in response.

 > -----Original Message-----
 > From: Tim Sweeney [mailto:tim@epicgames.com]
 > Sent: Friday, August 22, 2003 11:57 PM
 > To: types@cis.upenn.edu
 > Subject: Type theory vs floating-point arithmetic
 > [----- The Types Forum, http://www.cis.upenn.edu/~bcpierce/types -----]
 > I'm looking for pointers to attempts to reconcile the differences between
 > type theories and the data types and functions defined by the IEEE 754
 > floating-point standard.
 > For example, IEEE 754 defines equality over floating point numbers in 
a way
 > that conflicts with Leibniz equality.  There exist two IEEE "zeros", 
-0 and
 > +0, and the standard requires that they compare as if they are as equal,
 > though there exist functions that distinguish between them.  
 > IEEE defines a class of "not-a-number" values x such that x=x is deemed
 > false.

Yes, the IEEE 754 "==" operator does not define an equivalence
relation.  This is an unfortunate but necessary consequence driven by
other aspects of 754 design.  Floating-point arithmetic is a
systematic approximation to real arithmetic.  Besides only being able
to represent a finite subset of the real numbers, floating-point
arithmetic can only preserve a subset of the properties of real

Briefly, IEEE 754 floating-point arithmetic is a complete system; that
is, for all inputs, every arithmetic operation in the standard has a
defined IEEE 754 value as a result.  One benefit of this design is
allowing a computation to continue until a point where it is
convenient to check for "errors;" e.g. all divisions do not have to be
guarded by checks for a 0 divisor.  A finite value divided by 0 yields
a signed infinity; the mathematically undefined value 0/0 (lim x->0
x/x = 1; lim x->0 0/x = 0) yields NaN, not a number.  NaN is used as a
result for "invalid" operations that usually are not mathematically
defined; however, the rules for NaN generation and propagation are
well-defined by IEEE 754.  Since NaN values don't correspond to
numbers, they are not ordered with respect to other values.  NaN not
being equal to itself helps prevent NaN from falling through
conditional checks unnoticed.

There are two ways to add infinite values to the field of real

1.  a single projective infinity; i.e. the reals extended in this way
form a topological circle with infinity across from zero

2. two affine infinities with

-infinity < any finite value < +infinity

The reals extended in this way still form a line.

The final version of IEEE 754 choose to only support the latter
option; the original 8087 fpu and early 754 drafts supported both.
With affine infinities, signed zeros allow multiply and divide to use
the same rules for the sign of any non-NaN result, the XOR of the
operands' signs.  For example,

1.0/+infinity => +0.0
1.0/-infinity => -0.0

1.0/+0.0      => +infinity
1.0/-0.0      => -infinity

Dividing a finite result by zero is the only case where the sign of
zero affects the magnitude of the result; usually the sign of a zero
operand at most determines the sign of a zero result.  For example,

1.0 + +0.0    => 1.0
1.0 + -0.0    => 1.0

1.0 * +0.0    => +0.0
1.0 * -0.0    => -0.0

Both zeros correspond to the same real number zero; the sign of a zero
value records some information about the history of how that value was
computed.  This information can be used to create more consistent
complex math library functions; see

  W. Kahan, "Branch Cuts for Complex Elementary Functions, or Much Ado
  About Nothing's Sign Bit" in The State of the Art in Numerical
  Analysis, (eds. Iserles and Powell), Clarendon Press, Oxford, 1987.

While the design of IEEE 754 floating-point arithmetic may not be
intuitive, it is not capricious either.

 > This would seem to require that a language either expose two equality
 > operators (a natively-supplied Leibniz equality operator, and a "IEEE 754
 > numerical equality" operator which differs at some values), or that the
 > language disobey the floating-point standard with regard to 
equality.  Java
 > takes the later route, while C/C++ and the Haskell report appears to be
 > silent on the issue.

Java does not violate the IEEE 754 standard on this point.  Java's
floating-point comparison operators (<, <=, ==, etc.) return the
true/false value mandated by IEEE 754; see "The Java Language
Specification, 2nd Edition" section 15.20.1.  For example

NaN == NaN

is false;

-0.0 == 0.0

is true; and

NaN < 5
NaN > 5
NaN == 5

are all false.

The Java library methods Float.compare and Double.compare impose a
total ordering on floating-point values; with the compare method,

-0.0 < +0.0

is true and NaN is equal to itself.  These compare semantics are used
to sort float and double arrays and I use those methods to write tests
of the math library.

 > There is a further problem regarding subtyping.  It might first 
appear that
 > the type of 32-bit floating point values is a subtype of the type of 
 > floating point values.  Though one can losslessly convert from 32-bit to
 > 64-bit and back, performing a 32-bit arithmetic operation on such 32-bit
 > numbers can produce a different result than performing a 64-bit 
 > operation on their 64-bit extensions, and then converting back.  So 
from a
 > type theoretic point of view, the types of 32-bit and 64-bit floating 
 > numbers must be disjoint.

The above claim that for float values a and b,

a op b

gets a different result than

(float)((double)a op (double) b)

where op is an element of {+,-,*,/} is *not* true.  For two IEEE 754
floating-point formats, if the narrower format has p bits of precision
and the wider format has at least 2p+2 bits of precision, operations
on the narrower format can be implemented in the wider format in this
manner and get *exactly* the same numerical answer.  In the case of
the 32-bit float and 64-bit double formats, float has 24 bits of
precision and double has 53 bits of precision; 53 > 2*24 + 2.  This
feature of the relation between float and double was known at the time
754 was designed since some members of the committee were interested
in minimizing the effort needed to support the required float format
by a mostly double fpu.  A sketch of the proof of the 2p+2 result can
be found in David Goldberg's Appendix A on Computer Arithmetic in
Hennessy and Patterson "Computer Architectures: A Quantitative
Approach, 2nd edition."  The easiest operation to consider is
multiply.  The exact product of two float numbers can only require up
to 48 bits of precision; therefore, the exact product can be stored as
a double value.  When converted to float, that exact double product
will round to the appropriate float value, just as a true float
multiply would.  This follows from the fundamental simplicity of IEEE
754; from section 5 of that standard

  ... each of the [arithmetic] operations shall be performed as if
  first produced an intermediate result correct to infinite precision
  and with unbounded range, and then coerced this intermediate result
  to fit in the destination's format.

 > Finally, IEEE 754 mandates that a compliant computer architecture and
 > language pair expose a mechanism for reading and writing a set of "sticky
 > flags" which indicate whether any imprecise, overflowed, or invalid
 > operations have been encountered since the flags were last reset; and 
a set
 > of "rounding modes" affecting subsequent operations.  These concepts seem
 > incompatible with referential transparency and lazy evaluation.

As a practical matter, there have been few 754 compliant language
specifications or implementations and a large majority of numerical
codes don't use the oft-omitted features.  For a mainstream language,
C99 is very compliant.  However, I don't like all aspects of the C99
floating-point design; I have alternative ideas on how full IEEE 754
support could be added to a recent language: "Borneo 1.0.2: Adding
IEEE 754 floating point support to Java," http://www.jddarcy.org/Borneo.

If floating-point arithmetic is modeled by having the rounding mode as
an input to the floating-point operation, there is no issue with
referential transparency.  Likewise, the accumulated flag state can be
regarded as in input to a floating-point operation and the flag state
with any conditions raised by that operation ORed in can be regarded
as another output.  So instead of

op: double x double  -> double

you can model with

op: double x double x roundingmode x flags -> {double, flags}

Yes, this is rather more awkward; but in theory, if not in practice,
it finesses the referential transparency issue.

The paper

  David Goldberg, "The design of floating-point data types," ACM
  Letters on Programming Languages and Systems (LOPLAS), Volume 1
  Issue 2 June 1992.

includes a discussion of some numerical aspects of floating-point
types, including 754 issues.

One weakness of the original IEEE 754 standard document is that no
rationale for the design is included or referenced (although rationales
were published during the standard's development) and no guidance is
given for programming language designers.  For the 754 revision, the
committee hopes to address some of those shortcomings.

-Joe Darcy