是否为IEEE 754浮点数加倍z = x-y保证z + y == x? [英] Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?

查看:99
本文介绍了是否为IEEE 754浮点数加倍z = x-y保证z + y == x?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题,可以简化为以下问题陈述:

I have a problem that can be reduced to this problem statement:

给出一系列的双打,每个双打的范围都在[0, 1e7]范围内, 修改最后一个元素,使数字的总和等于 确切地是目标数字.系列的双打已经加起来 目标数字在epsilon(1e-7)内,但不是==.

Given a series of doubles where each is in the range [0, 1e7], modify the last element such that the sum of the numbers equals exactly a target number. The series of doubles already sums to the target number within an epsilon (1e-7), but they are not ==.


以下代码有效,但是否可以保证满足第一句中所述要求的所有输入有效?


The following code is working, but is it guaranteed to work for all inputs that meet the requirements described in the first sentence?

public static double[] FixIt(double[] input, double targetDouble)
{
    var result = new double[input.Length];
    if (input.Length == 0) return result;

    double sum = 0;
    for (int i = 0; i < input.Length - 1; i++)
    {
        sum += input[i];
        result[i] = input[i];
    }

    double remainder = targetDouble - sum;
    result[result.Length - 1] = remainder;
    return result;
}

var arr1 = Enumerable.Repeat(Math.PI / 13, 13).ToArray();
var arr2 = FixIt(arr1, Math.PI);

Debug.Print(Math.PI.ToString("R")); //3.1415926535897931
Debug.Print(arr1.Sum().ToString("R")); //3.1415926535897922
Debug.Print(arr2.Sum().ToString("R")); //3.1415926535897931

此问题的先前版本询问是否要修改第一个元素,但修改最后一个元素会将问题简化为已知的总和和已知的目标,剩下的问题仅是last = target-sum是否暗示sum+last == target的问题.

A previous version of this question asked about modifying the first element, but modifying the last element simplifies the problem to a known sum and a known target, leaving us with just the question of whether last = target-sum implies that sum+last == target.

(当然没有NaN,并且对范围的限制也暗示对last的一些限制,这可能会有所帮助.)

(Without NaN of course, and the restrictions on ranges imply some restrictions on last as well that might help.)

关于真正的问题:我们已经以多种方式使这个问题浮出水面,但是我们目前正在尝试做的是减少浮点误差,该浮点误差是由于数值而产生的.线性规划求解器(Coin-OR CBC)中的不稳定性.例如,有6个变量都必须在[0,X]范围内,并且变量的总和也必须为X.由于数值的不稳定性,求解器有时会返回略微为负的值以及未求和的值不能精确求和X.我们已经解决了负数问题-现在只是想解决X问题的总和. (是的,可能有些约束因我们更改结果而无法遵守,但确保将这些数字总和为X的优先级更高,而其他约束并不那么重要.)

推荐答案

z = x-y;不保证z+y == x,并且对于找到z这样的z+y == x的问题并不总是有解决方案.一个证明.

z = x-y; does not guarantee z+y == x, and there is not always a solution for the problem of finding a z such z+y == x. A proof follows.

我们假定IEEE-754二进制浮点算法的舍入为最接近的值,并为偶数.使用了基本的64位格式,但结果适用于其他格式.请注意,64位格式使用53位有效数字,这意味着只能表示有效位数为53或更少的数字.

We assume IEEE-754 binary floating-point arithmetic with rounding to nearest, ties to even. The basic 64-bit format is used, but the result holds for other formats. Note that the 64-bit format uses 53-bit significands, meaning that only numbers with 53 or fewer significant binary digits can be represented.

考虑目标x等于1 + 2 -52 .设y为2 −53 .然后,在z = x-y;之后,z+y == x评估为false.算术详细信息如下所示,但是:

Consider a target x equal to 1+2−52. Let y be 2−53. Then, after z = x-y;, z+y == x evaluates to false. The arithmetic details are shown below, but:

  • z = x-y;z设置为1,然后z+y产生1,该数字小于x.
  • 如果我们将z增加到下一个可表示的数字1 + 2 −52 ,则z+y会产生1 + 2 −51 ,大于x.
  • 因此,没有z的值使z+y == x为真.
  • z = x-y; sets z to 1, and then z+y produces 1, which is less than x.
  • If we increase z to the next representable number, 1+2−52, then z+y produces 1+2−51, which is greater than x.
  • So there is no value of z that makes z+y == x true.

详细信息:

x-y的数学结果为1 + 2 -53 .由于它具有54个有效位(从2 0 到2 −53 ),因此无法表示,并且x-y的计算结果必须四舍五入.两个最接近的数字是1和1 + 2 −52 .平分关系规则产生前一个数字1,因为其有效位数的低位为0,而1 + 2 -52 的低位为1.

The mathematical result of xy is 1+2−53. As this has 54 significant bits (from 20 to 2−53), it is not representable, and the computed result of x-y must be rounded. The two nearest numbers are 1 and 1+2−52. The ties-to-even rule produces the former number, 1, as the low bit of its significand is 0, while the low bit for 1+2−52 is 1.

因此z = x-y;z设置为1.

z + y的数学结果为1 + 2 −53 .如上所述,将其舍入为1,因此z+y的计算结果为1.因此z+y == x将1与1 + 2 -52 进行比较,并生成false.

Then the mathematical result of z+y is 1+2−53. As above, this is rounded to 1, so the computed result of z+y is 1. So z+y == x compares 1 to 1+2−52 and produces false.

此外,z的任何值都不能使比较正确.如果我们以最小的可用步长将z从1增加到1 + 2 -52 ,则z + y的数学和就是1 + 2 -52 +2 −53 .这在两个可表示的数字1 + 2 -52 和1 + 2 -51 之间.前者的低位为1,后者的低位为0,因此此z+y的计算结果为1 + 2 −51 ,当然不等于1 +2 −52 .

Furthermore, no value of z could make the comparison true. If we increment z by the smallest available step, from 1 to 1+2−52, the mathematical sum of z+y is then 1+2−52+2−53. This is midway between the two representable numbers 1+2−52 and 1+2−51. The former has a low bit of 1, and the latter has a low bit of 0, so the computed result of this z+y is 1+2−51, which is of course not equal to 1+2−52.

浮点加法是弱单调的,因此没有z的值会为z+y产生1 + 2 -52 .

Floating-point addition is weakly monotonic, so there are no values of z that would produce 1+2−52 for z+y.

这篇关于是否为IEEE 754浮点数加倍z = x-y保证z + y == x?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆