[Prev][Next][Index][Thread]

Re: Type theory vs floating-point arithmetic



Joe Darcy wrote:
> 
> Tim Sweeny wrote:
>  > There is a further problem regarding subtyping.  It might first appear that
>  > the type of 32-bit floating point values is a subtype of the type of 64-bit
>  > floating point values.  Though one can losslessly convert from 32-bit to
>  > 64-bit and back, performing a 32-bit arithmetic operation on such 32-bit
>  > numbers can produce a different result than performing a 64-bit arithmetic
>  > operation on their 64-bit extensions, and then converting back.  So from a
>  > type theoretic point of view, the types of 32-bit and 64-bit floating point
>  > numbers must be disjoint.
> 
> The above claim that for float values a and b,
> 
> a op b
> 
> gets a different result than
> 
> (float)((double)a op (double) b)
> 
> where op is an element of {+,-,*,/} is *not* true. 

It must be pointed out that Joe Darcy only claims this for a single
elementary operation. The analogous property is not true for composite
operations, e.g., already

(a op b) op c

is in general equal to:

(float)((double)(float)((double)a op (double)b) op (double)c),

but not to:

(float)(((double)a op (double)b) op (double)c),

The intermediate (double)(float) conversion cannot be omitted, because
the intermediate result is not in general representable as a float. 

So Tim Sweeny's original claim is still correct if one understands the
word "operation" to include composite operations (as one would, in the
context of subtyping). 

However, clearly this particular feature of floating point arithmetic
is desirable; the only way around it would be to force each elementary
operation to return a float if both inputs can be represented as a
float. This would be silly, as the whole point of using
higher-precision types for intermediate results is to get a more
accurate final answer.

-- Peter