IEEE Std 754浮点:让t:= a - b,标准是否保证a == b + t? [英] IEEE Std 754 Floating-Point: let t := a - b, does the standard guarantee that a == b + t?
问题描述
t
, a
, b
是所有双重(IEEE Std 754)变量,并且 a
, b
的值都不是 NaN
(但可能是 Inf
)。 在
t = a - b
之后,是否必须有 a == b + t
?
绝对不是。一个明显的例子是 a = DBL_MAX
, b = -DBL_MAX
。那么 t = INFINITY
,所以 b + t
也是 INFINITY
。
更令人惊讶的是,有些情况下这种情况不会发生溢出。基本上,它们都是 a
是 DBL_EPSILON / 4
并且 b
是 -1
, ab
是1(假设默认舍入模式), a-b + b
是0。
我提到第二个例子的原因是这是 迫使四舍五入达到IEEE算术的特定精度。例如,如果你有一个范围在[0,1]的数字,并且想要强制将它舍入到4位的精度,你可以添加然后减去 0x1p49
。
Assume that t
,a
,b
are all double (IEEE Std 754) variables, and both values of a
, b
are NOT NaN
(but may be Inf
).
After t = a - b
, do I necessarily have a == b + t
?
Absolutely not. One obvious case is a=DBL_MAX
, b=-DBL_MAX
. Then t=INFINITY
, so b+t
is also INFINITY
.
What may be more surprising is that there are cases where this happens without any overflow. Basically, they're all of the form where a-b
is inexact. For example, if a
is DBL_EPSILON/4
and b
is -1
, a-b
is 1 (assuming default rounding mode), and a-b+b
is then 0.
The reason I mention this second example is that this is the canonical way of forcing rounding to a particular precision in IEEE arithmetic. For instance, if you have a number in the range [0,1) and want to force rounding it to 4 bits of precision, you would add and then subtract 0x1p49
.
这篇关于IEEE Std 754浮点:让t:= a - b,标准是否保证a == b + t?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!