是否为IEEE 754浮点数加倍z = x-y保证z + y == x? [英] Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?

查看：99 发布时间：2020/11/8 21:42:07 c# floating-point ieee-754

本文介绍了是否为IEEE 754浮点数加倍z = x-y保证z + y == x?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个问题，可以简化为以下问题陈述:

I have a problem that can be reduced to this problem statement:

给出一系列的双打，每个双打的范围都在[0, 1e7]范围内，修改最后一个元素，使数字的总和等于确切地是目标数字.系列的双打已经加起来目标数字在epsilon(1e-7)内，但不是==.

Given a series of doubles where each is in the range [0, 1e7], modify the last element such that the sum of the numbers equals exactly a target number. The series of doubles already sums to the target number within an epsilon (1e-7), but they are not ==.

以下代码有效，但是否可以保证满足第一句中所述要求的所有输入有效?

The following code is working, but is it guaranteed to work for all inputs that meet the requirements described in the first sentence?

public static double[] FixIt(double[] input, double targetDouble)
{
    var result = new double[input.Length];
    if (input.Length == 0) return result;

    double sum = 0;
    for (int i = 0; i < input.Length - 1; i++)
    {
        sum += input[i];
        result[i] = input[i];
    }

    double remainder = targetDouble - sum;
    result[result.Length - 1] = remainder;
    return result;
}

var arr1 = Enumerable.Repeat(Math.PI / 13, 13).ToArray();
var arr2 = FixIt(arr1, Math.PI);

Debug.Print(Math.PI.ToString("R")); //3.1415926535897931
Debug.Print(arr1.Sum().ToString("R")); //3.1415926535897922
Debug.Print(arr2.Sum().ToString("R")); //3.1415926535897931

此问题的先前版本询问是否要修改第一个元素，但修改最后一个元素会将问题简化为已知的总和和已知的目标，剩下的问题仅是last = target-sum是否暗示sum+last == target的问题.

A previous version of this question asked about modifying the first element, but modifying the last element simplifies the problem to a known sum and a known target, leaving us with just the question of whether last = target-sum implies that sum+last == target.

(当然没有NaN，并且对范围的限制也暗示对last的一些限制，这可能会有所帮助.)

(Without NaN of course, and the restrictions on ranges imply some restrictions on last as well that might help.)

_{关于真正的问题:我们已经以多种方式使这个问题浮出水面，但是我们目前正在尝试做的是减少浮点误差，该浮点误差是由于数值而产生的.线性规划求解器(Coin-OR CBC)中的不稳定性.例如，有6个变量都必须在[0，X]范围内，并且变量的总和也必须为X.由于数值的不稳定性，求解器有时会返回略微为负的值以及未求和的值不能精确求和X.我们已经解决了负数问题-现在只是想解决X问题的总和. (是的，可能有些约束因我们更改结果而无法遵守，但确保将这些数字总和为X的优先级更高，而其他约束并不那么重要.)}

推荐答案

z = x-y;不保证z+y == x，并且对于找到z这样的z+y == x的问题并不总是有解决方案.一个证明.

z = x-y; does not guarantee z+y == x, and there is not always a solution for the problem of finding a z such z+y == x. A proof follows.

我们假定IEEE-754二进制浮点算法的舍入为最接近的值，并为偶数.使用了基本的64位格式，但结果适用于其他格式.请注意，64位格式使用53位有效数字，这意味着只能表示有效位数为53或更少的数字.

We assume IEEE-754 binary floating-point arithmetic with rounding to nearest, ties to even. The basic 64-bit format is used, but the result holds for other formats. Note that the 64-bit format uses 53-bit significands, meaning that only numbers with 53 or fewer significant binary digits can be represented.

考虑目标x等于1 + 2 ^-52.设y为2 ⁻⁵³.然后，在z = x-y;之后，z+y == x评估为false.算术详细信息如下所示，但是:

Consider a target x equal to 1+2⁻⁵². Let y be 2⁻⁵³. Then, after z = x-y;, z+y == x evaluates to false. The arithmetic details are shown below, but:

z = x-y;将z设置为1，然后z+y产生1，该数字小于x.
如果我们将z增加到下一个可表示的数字1 + 2 ⁻⁵²，则z+y会产生1 + 2 ⁻⁵¹，大于x.
因此，没有z的值使z+y == x为真.

z = x-y; sets z to 1, and then z+y produces 1, which is less than x.
If we increase z to the next representable number, 1+2⁻⁵², then z+y produces 1+2⁻⁵¹, which is greater than x.
So there is no value of z that makes z+y == x true.

详细信息:

x-y的数学结果为1 + 2 ^-53.由于它具有54个有效位(从2 ⁰到2 ⁻⁵³)，因此无法表示，并且x-y的计算结果必须四舍五入.两个最接近的数字是1和1 + 2 ⁻⁵².平分关系规则产生前一个数字1，因为其有效位数的低位为0，而1 + 2 ^-52的低位为1.

The mathematical result of x−y is 1+2⁻⁵³. As this has 54 significant bits (from 2⁰ to 2⁻⁵³), it is not representable, and the computed result of x-y must be rounded. The two nearest numbers are 1 and 1+2⁻⁵². The ties-to-even rule produces the former number, 1, as the low bit of its significand is 0, while the low bit for 1+2⁻⁵² is 1.

因此z = x-y;将z设置为1.

则z + y的数学结果为1 + 2 ⁻⁵³.如上所述，将其舍入为1，因此z+y的计算结果为1.因此z+y == x将1与1 + 2 ^-52进行比较，并生成false.

Then the mathematical result of z+y is 1+2⁻⁵³. As above, this is rounded to 1, so the computed result of z+y is 1. So z+y == x compares 1 to 1+2⁻⁵² and produces false.

此外，z的任何值都不能使比较正确.如果我们以最小的可用步长将z从1增加到1 + 2 ^-52，则z + y的数学和就是1 + 2 ^-52 +2 ⁻⁵³.这在两个可表示的数字1 + 2 ^-52和1 + 2 ^-51之间.前者的低位为1，后者的低位为0，因此此z+y的计算结果为1 + 2 ⁻⁵¹，当然不等于1 +2 ⁻⁵².

Furthermore, no value of z could make the comparison true. If we increment z by the smallest available step, from 1 to 1+2⁻⁵², the mathematical sum of z+y is then 1+2⁻⁵²+2⁻⁵³. This is midway between the two representable numbers 1+2⁻⁵² and 1+2⁻⁵¹. The former has a low bit of 1, and the latter has a low bit of 0, so the computed result of this z+y is 1+2⁻⁵¹, which is of course not equal to 1+2⁻⁵².

浮点加法是弱单调的，因此没有z的值会为z+y产生1 + 2 ^-52.

Floating-point addition is weakly monotonic, so there are no values of z that would produce 1+2⁻⁵² for z+y.

这篇关于是否为IEEE 754浮点数加倍z = x-y保证z + y == x?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是否为IEEE 754浮点数加倍z = x-y保证z + y == x? [英] Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

是否为IEEE 754浮点数加倍z = x-y保证z + y == x? [英] Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭