浮点比较重新审视 [英] Floating point comparison revisited

查看:175
本文介绍了浮点比较重新审视的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个主题在StackOverflow上出现了很多次,但我相信这是一个新的。是,我已阅读 Bruce Dawson的文章每个计算机科学家应该知道的浮点算术这个不错的回答



根据我的理解,在一个典型的系统上,比较浮点数的相等性有四个基本问题:


  1. 浮点计算不准确

  2. 是否 ab 是小取决于 a b

  3. 的规模
  4. ab 是否为小取决于 a b (例如float,double,long double)

  5. 浮点通常具有+ -infinity,NaN和非规范化表示,任何一个都可能会干扰naï公式

此回答 - aka。 Google方法 - 似乎很受欢迎。它处理所有棘手的案件。并且它会非常精确地缩放比较,检查两个值是否在固定数量的彼此的 ULP 内。因此,例如,非常大的数字将几乎相等与无穷大比较。



但是:




  • 在我看来很麻烦。

  • 它不是特别可移植的,主要依赖内部表示,使用联合从浮点数等读取位

  • 它只处理单精度和双精度IEEE 754(特别是没有x86长双)



我想要类似的东西,但使用标准C ++和处理长双打。 标准,我的意思是C ++ 03如果可能和C ++ 11如果必要。



这是我的尝试。

  #include< cmath> 
#include< limits>
#include< algorithm>

namespace {
//特殊处理无穷大的frexp()的本地版本。
template< typename T>
T my_frexp(const T num,int * exp)
{
typedef std :: numeric_limits< T>限制;

//将+ -infinity视为+ - (2 ^ max_exponent)。
if(std :: abs(num)> limits :: max())
{
* exp = limits :: max_exponent + 1;
return std :: copysign(0.5,num);
}
else return std :: frexp(num,exp);
}
}

template< typename T>
bool almostEqual(const T a,const T b,const unsigned ulps = 4)
{
//处理NaN。
if(std :: isnan(a)|| std :: isnan(b))
return false;

typedef std :: numeric_limits< T>限制;

//处理非常小和完全相等的值。
if(std :: abs(a-b)< = ulps * limits :: denorm_min())
return true;

// frexp()对零执行错误的操作。但是如果我们得到这个远
//并且任一数字为零,那么另一个太大,所以只是
//现在处理。
if(a == 0 || b == 0)
return false;

//将数字分成有效数和指数,并将它们按
//指数排序。
int min_exp,max_exp;
T min_frac = my_frexp(a,& min_exp);
T max_frac = my_frexp(b,& max_exp);
if(min_exp> max_exp)
{
std :: swap(min_frac,max_frac);
std :: swap(min_exp,max_exp);
}

//通过调整其
//有意义数,将较小的值转换为较大值的标度。
const T scaled_min_frac = std :: ldexp(min_frac,min_exp-max_exp);

//由于有效数目现在是相同的尺度,较大的
//在[0.5,1]范围内,因此1 ulp只是ε/ 2。
return std :: abs(max_frac-scaled_min_frac)< = ulps * limits :: epsilon()/ 2;
}



我声称这个代码(a)处理所有相关的情况, b)与IEEE-754单精度和双精度的Google实现相同,(c)是完全标准的C ++。



一个或多个索赔几乎肯定是错误的。我会接受任何答案,表明这样,最好是一个修复。一个好的答案应该包括以下一个或多个:




  • 特定输入相差超过 ulps 最后单位中的单位,但此函数返回true(差异越大越好)

  • 特定输入之间的差异最多为 ulps 最后位置中的单位,但此函数返回false(差值越小越好)

  • 我错过的任何案例

  • 此代码依赖于未定义的行为或打破依赖于实现定义的行为的任何方式。

  • 修复您识别的任何问题

  • 以任何方式简化代码,

    我想为这个问题设置一个不平凡的奖励。 h2_lin>解决方案

    几乎等于不是一个好函数



    4不是一个合适的值:你所指的答案是因此,4应该足够普通使用,但不包含该项索赔的依据。事实上,存在通常情况,其中通过不同方式以浮点计算的数字可能由于许多ULP而不同,即使如果通过精确数学计算它们将是相等的。因此,公差不应有默认值;每个用户应该被要求提供自己的,希望基于他们的代码的彻底分析。



    作为默认4 ULP的原因的例子,考虑 1./49*49-1 。数学上精确的结果是0,但是计算结果(64位IEEE 754二进制)是-0x1p-53,超过精确结果的1e307ULP的误差和计算结果的几乎1e16 ULP。



    有时,没有值是适当的:在某些情况下,容差不能与被比较的值相关,无论是数学上精确的相对容差还是量化的ULP容差。例如,FFT中的几乎每个输出值都受几乎每个输入值的影响,并且任何一个元件中的误差都与其他元件的大小有关。 几乎相等例程必须提供有关潜在错误信息的其他上下文。



    几乎相等具有差的数学属性:显示几乎相等的缺点之一:缩放更改结果。下面的代码打印1和0。

      double x0 = 1.1; 
    double x1 = 1.1 + 3 * 0x1p-52;
    std :: cout<< almostEqual(x0,x1)< \\\
    ;
    x0 * = .8;
    x1 * = .8;
    std :: cout<< almostEqual(x0,x1)< \\\
    ;

    另一个失败是它不可传递; almostEqual(a,b) almostEqual(b,c)不表示 almostEqual a,c)



    极端情况下的错误



    almostEqual(1.f,1.f / 11,0x745d17)不正确地返回1。



    1.f / 11是0x1 .745d18p-4。从1(这是0x10p-4)减去这产生0xe.8ba2e8p-4。由于ULP为1是0x1p-23,即0xe.8ba2e8p19 ULP = 0xe8ba2e.8 / 2 ULP(移位20位并除以2,净19位)= 0x745d17.4 ULP。这超过了指定的容差0x745d17,所以正确的答案将是0.



    这个错误是由舍入在 max_frac-scaled_min_frac



    这个问题的一个简单的转义是指定 ulps 必须小于 .5 / limits :: epsilon 。只有当差值(即使舍入)超过 ulps 时, max_frac-scaled_min_frac



    有人建议使用 long double 来纠正这个。但是, long double 将无法更正此错误。考虑比较1和-0x1p-149f,ulps设置为1 / limits :: epsilon。除非您的有效位有149位,减法结果舍入为1,小于或等于1 / limits :: epsilon ULP。然而,数学上的差异明显超过1。



    小写笔记



    表达式 limits :: epsilon / 2 将因子转换为浮点类型,这导致不能精确表示的大值因子的舍入误差。可能的是,该例程不是用于这样大的值(在浮动中的百万个ULP),因此这应该被指定为对例程的限制而不是错误。


    This topic has come up many times on StackOverflow, but I believe this is a new take. Yes, I have read Bruce Dawson's articles and What Every Computer Scientist Should Know About Floating-Point Arithmetic and this nice answer.

    As I understand it, on a typical system there are four basic problems when comparing floating-point numbers for equality:

    1. Floating point calculations are not exact
    2. Whether a-b is "small" depends on the scale of a and b
    3. Whether a-b is "small" depends on the type of a and b (e.g. float, double, long double)
    4. Floating point typically has +-infinity, NaN, and denormalized representations, any of which can interfere with a naïve formulation

    This answer -- aka. "the Google approach" -- seems to be popular. It does handle all of the tricky cases. And it does scale the comparison very precisely, checking whether two values are within a fixed number of ULPs of each other. Thus, for example, a very large number compares "almost equal" to infinity.

    However:

    • It is very messy, in my opinion.
    • It is not particularly portable, relying heavily on internal representations, using a union to read the bits from a float, etc.
    • It only handles single-precision and double-precision IEEE 754 (in particular, no x86 long double)

    I want something similar, but using standard C++ and handling long doubles. By "standard", I mean C++03 if possible and C++11 if necessary.

    Here is my attempt.

    #include <cmath>
    #include <limits>
    #include <algorithm>
    
    namespace {
    // Local version of frexp() that handles infinities specially.
    template<typename T>
    T my_frexp(const T num, int *exp)
    {
        typedef std::numeric_limits<T> limits;
    
        // Treat +-infinity as +-(2^max_exponent).
        if (std::abs(num) > limits::max())
        {
            *exp = limits::max_exponent + 1;
            return std::copysign(0.5, num);
        }
        else return std::frexp(num, exp);
    }
    }
    
    template<typename T>
    bool almostEqual(const T a, const T b, const unsigned ulps=4)
    {
        // Handle NaN.
        if (std::isnan(a) || std::isnan(b))
            return false;
    
        typedef std::numeric_limits<T> limits;
    
        // Handle very small and exactly equal values.
        if (std::abs(a-b) <= ulps * limits::denorm_min())
            return true;
    
        // frexp() does the wrong thing for zero.  But if we get this far
        // and either number is zero, then the other is too big, so just
        // handle that now.
        if (a == 0 || b == 0)
            return false;
    
        // Break the numbers into significand and exponent, sorting them by
        // exponent.
        int min_exp, max_exp;
        T min_frac = my_frexp(a, &min_exp);
        T max_frac = my_frexp(b, &max_exp);
        if (min_exp > max_exp)
        {
            std::swap(min_frac, max_frac);
            std::swap(min_exp, max_exp);
        }
    
        // Convert the smaller to the scale of the larger by adjusting its
        // significand.
        const T scaled_min_frac = std::ldexp(min_frac, min_exp-max_exp);
    
        // Since the significands are now in the same scale, and the larger
        // is in the range [0.5, 1), 1 ulp is just epsilon/2.
        return std::abs(max_frac-scaled_min_frac) <= ulps * limits::epsilon() / 2;
    }
    

    I claim that this code (a) handles all of the relevant cases, (b) does the same thing as the Google implementation for IEEE-754 single- and double-precision, and (c) is perfectly standard C++.

    One or more of these claims is almost certainly wrong. I will accept any answer that demonstrates such, preferably with a fix. A good answer should include one or more of:

    • Specific inputs differing by more than ulps Units in Last Place, but for which this function returns true (the bigger the difference, the better)
    • Specific inputs differing by up to ulps Units in Last Place, but for which this function returns false (the smaller the difference, the better)
    • Any case(s) I have missed
    • Any way in which this code relies on undefined behavior or breaks depending on implementation-defined behavior. (If possible, please cite a relevant spec.)
    • Fixes for whatever problem(s) you identify
    • Any way to simplify the code without breaking it

    I intend to place a non-trivial bounty on this question.

    解决方案

    "Almost Equals" Is Not a Good Function

    4 is not an appropriate value: The answer you point to states "Therefore, 4 should be enough for ordinary use" but contains no basis for that claim. In fact, there are ordinary situations in which numbers calculated in floating-point by different means may differ by many ULP even though they would be equal if calculated by exact mathematics. Therefore, there should be no default value for the tolerance; each user should be required to supply their own, hopefully based on thorough analysis of their code.

    As an example of why a default of 4 ULP is bad, consider 1./49*49-1. The mathematically exact result is 0, but the computed result (64-bit IEEE 754 binary) is -0x1p-53, an error exceeding 1e307 ULP of the exact result and almost 1e16 ULP of the computed result.

    Sometimes, no value is appropriate: In some cases, the tolerance cannot be relative to the values being compared, neither a mathematically exact relative tolerance nor a quantized ULP tolerance. For example, nearly every output value in an FFT is affected by nearly every input value, and the error in any one element is related to the magnitude of other elements. An "almost equals" routine must be supplied additional context with information about the potential error.

    "Almost Equals" has poor mathematical properties: This shows one of the shortcomings of "almost equals": Scaling changes the results. The code below prints 1 and 0.

    double x0 = 1.1;
    double x1 = 1.1 + 3*0x1p-52;
    std::cout << almostEqual(x0, x1) << "\n";
    x0 *= .8;
    x1 *= .8;
    std::cout << almostEqual(x0, x1) << "\n";
    

    Another failing is that it is not transitive; almostEqual(a, b) and almostEqual(b, c) does not imply almostEqual(a, c).

    A Bug in Extreme Cases

    almostEqual(1.f, 1.f/11, 0x745d17) incorrectly returns 1.

    1.f/11 is 0x1.745d18p-4. Subtracting this from 1 (which is 0x10p-4) yields 0xe.8ba2e8p-4. Since an ULP of 1 is 0x1p-23, that is 0xe.8ba2e8p19 ULP = 0xe8ba2e.8/2 ULP (shifted 20 bits and divided by 2, netting 19 bits) = 0x745d17.4 ULP. That exceeds the specified tolerance of 0x745d17, so the correct answer would be 0.

    This error is caused by rounding in max_frac-scaled_min_frac.

    An easy escape from this problem is to specify that ulps must be less than .5/limits::epsilon. Then rounding occurs in max_frac-scaled_min_frac only if the difference (even when rounded) exceeds ulps; if the difference is less than that, the subtraction is exact, by Sterbenz’ Lemma.

    There was a suggestion about using long double to correct this. However, long double would not correct this. Consider comparing 1 and -0x1p-149f with ulps set to 1/limits::epsilon. Unless your significand has 149 bits, the subtraction result rounds to 1, which is less than or equal to 1/limits::epsilon ULP. Yet the mathematical difference clearly exceeds 1.

    Minor Note

    The expression factor * limits::epsilon / 2 converts factor to the floating-point type, which causes rounding errors for large values of factor that are not exactly representable. Likely, the routine is not intended to be used with such large values (millions of ULPs in float), so this ought to be specified as a limit on the routine rather than a bug.

    这篇关于浮点比较重新审视的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆