是否为IEEE 754浮点数加倍z = x-y保证z + y == x? [英] Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?
问题描述
我有一个问题,可以简化为以下问题陈述:
I have a problem that can be reduced to this problem statement:
给出一系列的双打,每个双打的范围都在
[0, 1e7]
范围内, 修改最后一个元素,使数字的总和等于 确切地是目标数字.系列的双打已经加起来 目标数字在epsilon(1e-7)内,但不是==.
Given a series of doubles where each is in the range
[0, 1e7]
, modify the last element such that the sum of the numbers equals exactly a target number. The series of doubles already sums to the target number within an epsilon (1e-7), but they are not ==.
以下代码有效,但是否可以保证满足第一句中所述要求的所有输入有效?
The following code is working, but is it guaranteed to work for all inputs that meet the requirements described in the first sentence?
public static double[] FixIt(double[] input, double targetDouble)
{
var result = new double[input.Length];
if (input.Length == 0) return result;
double sum = 0;
for (int i = 0; i < input.Length - 1; i++)
{
sum += input[i];
result[i] = input[i];
}
double remainder = targetDouble - sum;
result[result.Length - 1] = remainder;
return result;
}
var arr1 = Enumerable.Repeat(Math.PI / 13, 13).ToArray();
var arr2 = FixIt(arr1, Math.PI);
Debug.Print(Math.PI.ToString("R")); //3.1415926535897931
Debug.Print(arr1.Sum().ToString("R")); //3.1415926535897922
Debug.Print(arr2.Sum().ToString("R")); //3.1415926535897931
此问题的先前版本询问是否要修改第一个元素,但修改最后一个元素会将问题简化为已知的总和和已知的目标,剩下的问题仅是last = target-sum
是否暗示sum+last == target
的问题.
A previous version of this question asked about modifying the first element, but modifying the last element simplifies the problem to a known sum and a known target, leaving us with just the question of whether last = target-sum
implies that sum+last == target
.
(当然没有NaN,并且对范围的限制也暗示对last
的一些限制,这可能会有所帮助.)
(Without NaN of course, and the restrictions on ranges imply some restrictions on last
as well that might help.)
关于真正的问题:我们已经以多种方式使这个问题浮出水面,但是我们目前正在尝试做的是减少浮点误差,该浮点误差是由于数值而产生的.线性规划求解器(Coin-OR CBC)中的不稳定性.例如,有6个变量都必须在[0,X]范围内,并且变量的总和也必须为X.由于数值的不稳定性,求解器有时会返回略微为负的值以及未求和的值不能精确求和X.我们已经解决了负数问题-现在只是想解决X问题的总和. (是的,可能有些约束因我们更改结果而无法遵守,但确保将这些数字总和为X的优先级更高,而其他约束并不那么重要.)
推荐答案
z = x-y;
不保证z+y == x
,并且对于找到z
这样的z+y == x
的问题并不总是有解决方案.一个证明.
z = x-y;
does not guarantee z+y == x
, and there is not always a solution for the problem of finding a z
such z+y == x
. A proof follows.
我们假定IEEE-754二进制浮点算法的舍入为最接近的值,并为偶数.使用了基本的64位格式,但结果适用于其他格式.请注意,64位格式使用53位有效数字,这意味着只能表示有效位数为53或更少的数字.
We assume IEEE-754 binary floating-point arithmetic with rounding to nearest, ties to even. The basic 64-bit format is used, but the result holds for other formats. Note that the 64-bit format uses 53-bit significands, meaning that only numbers with 53 or fewer significant binary digits can be represented.
考虑目标x
等于1 + 2 -52 .设y
为2 −53 .然后,在z = x-y;
之后,z+y == x
评估为false.算术详细信息如下所示,但是:
Consider a target x
equal to 1+2−52. Let y
be 2−53. Then, after z = x-y;
, z+y == x
evaluates to false. The arithmetic details are shown below, but:
-
z = x-y;
将z
设置为1,然后z+y
产生1,该数字小于x
. - 如果我们将
z
增加到下一个可表示的数字1 + 2 −52 ,则z+y
会产生1 + 2 −51 ,大于x
. - 因此,没有
z
的值使z+y == x
为真.
z = x-y;
setsz
to 1, and thenz+y
produces 1, which is less thanx
.- If we increase
z
to the next representable number, 1+2−52, thenz+y
produces 1+2−51, which is greater thanx
. - So there is no value of
z
that makesz+y == x
true.
详细信息:
x
-y
的数学结果为1 + 2 -53 .由于它具有54个有效位(从2 0 到2 −53 ),因此无法表示,并且x-y
的计算结果必须四舍五入.两个最接近的数字是1和1 + 2 −52 .平分关系规则产生前一个数字1,因为其有效位数的低位为0,而1 + 2 -52 的低位为1.
The mathematical result of x
−y
is 1+2−53. As this has 54 significant bits (from 20 to 2−53), it is not representable, and the computed result of x-y
must be rounded. The two nearest numbers are 1 and 1+2−52. The ties-to-even rule produces the former number, 1, as the low bit of its significand is 0, while the low bit for 1+2−52 is 1.
因此z = x-y;
将z
设置为1.
则z
+ y
的数学结果为1 + 2 −53 .如上所述,将其舍入为1,因此z+y
的计算结果为1.因此z+y == x
将1与1 + 2 -52 进行比较,并生成false.
Then the mathematical result of z
+y
is 1+2−53. As above, this is rounded to 1, so the computed result of z+y
is 1. So z+y == x
compares 1 to 1+2−52 and produces false.
此外,z
的任何值都不能使比较正确.如果我们以最小的可用步长将z
从1增加到1 + 2 -52 ,则z
+ y
的数学和就是1 + 2 -52 +2 −53 .这在两个可表示的数字1 + 2 -52 和1 + 2 -51 之间.前者的低位为1,后者的低位为0,因此此z+y
的计算结果为1 + 2 −51 ,当然不等于1 +2 −52 .
Furthermore, no value of z
could make the comparison true. If we increment z
by the smallest available step, from 1 to 1+2−52, the mathematical sum of z
+y
is then 1+2−52+2−53. This is midway between the two representable numbers 1+2−52 and 1+2−51. The former has a low bit of 1, and the latter has a low bit of 0, so the computed result of this z+y
is 1+2−51, which is of course not equal to 1+2−52.
浮点加法是弱单调的,因此没有z
的值会为z+y
产生1 + 2 -52 .
Floating-point addition is weakly monotonic, so there are no values of z
that would produce 1+2−52 for z+y
.
这篇关于是否为IEEE 754浮点数加倍z = x-y保证z + y == x?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!