在C或C ++中比较两个浮点值的总和 [英] comparing two sums of floating point values in C or C++

查看:316
本文介绍了在C或C ++中比较两个浮点值的总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设根据IEEE754给出了两组浮点变量,这些浮点变量被认为是根据标准中公式计算的精确值。所有合法价值都是可能的。在集合中变量的数量可以是任何自然数。

在数学意义上,比较由所述变量表示的值的总和是一个好方法。由于域的性质,这个问题可以很容易地表示为比较一个单一的总和为零。你可以忽视NaN或Infinities的存在,因为它与核心问题无关。 (这些值可以容易地和独立地检查,并以适合这个问题的特定方式的方式来操作。)

一个简单的方法是简单地总结和比较,或者和一个值的集合和另一个值的相减。

  bool compare(const std :: vector< float>& lhs,const std :: vector< float>& rhs)
{
float lSum = 0.0f;
for(auto value:lhs)
{
lSum + = value;
}
float rSum = 0.0f;
for(auto value:rhs)
{
rSum + = value;
}

return lSum< RSUM;





$ b

很明显,天真的方法存在一些问题,浮点运算。大部分的问题都与两个难题有关:

$ ul

  • 浮点值的加法结果根据顺序而不同
  • 某些特定值的添加顺序可能导致中间溢出(计算的中间结果超出了可用数据类型支持的范围)

      float small = strtof(0x1.0p-126,NULL); 
    float big = strtof(0x1.8p126,NULL);

    std :: cout<< std :: hexfloat<<小+大 - 大<<的std :: ENDL;
    std :: cout<< std :: hexfloat<< (big-2 * small)+(big-small)+ big - (big + small) - (big + 2 * small)<<<的std :: ENDL;

    这段代码会导致 0 INF ;这说明了排序如何影响结果。希望订货的问题也不重要。

      float prev; 
    float curr = 0.0f;

    do
    {
    prev = curr;
    curr + = strtof(0x1.0p-126,NULL);
    } while(prev!= curr);

    std :: cout<< std :: hexfloat<< curr<的std :: ENDL;




  • 计算,会导致 0x1.000000p-102 ,而不是天真的预期, 0x1.fffffep127 (Change curr初始化到`strtof(0x1.fff000p-103)将被建议实际观察这个。这说明了加法的中间结果和特定加数之间的比例是如何影响结果的。

    有很多关于获得最佳精度的说法。 这个问题



    手边的问题不同,我们不想最大化精确度,但是我们有一个精确定义的函数,需要精确地实现。



    虽然对于一些可能有用的练习来说似乎有争议,但考虑以下情况:这些值集合之间的比较可能是在整个数据集上独立进行的其他操作的基石各种环境。一些系统的同步,完美的操作可能取决于这个比较是否被定义良好和确定性地实现,而不管加密顺序和实现IEEE754的特定架构如何。



    在讨论中, Kahan求和算法被认为是相关的。然而,这个算法是一个合理的尝试,以尽量减少错误。它既不能保证结果的正确性,也不能独立于操作的顺序(至少保证一致性,如果错误的话,对于集合的排列结果)。

    一最显而易见的解决方案是使用/实现使用足够数量的位的定点算术,以精确地表示每个可能的操作数值,并保持精确的中间结果。

    可以只使用浮点运算来完成,保证正确的结果符号。如果是这样,溢出问题(如上面的例子之一所示)需要在解决方案中解决,因为这个问题具有特定的技术方面。
    $ b

    是原来的问题。)

    我有两组浮点数(float或double)。我想为这个问题提供一个完美的答案,这个问题具有较大的总和。由于浮点算术中的伪像,在某些角落情况下,朴素方法的结果可能是错误的,这取决于操作的顺序。更不用说简单的和可以导致溢出。
    我不能在我身上付出任何努力,因为我所拥有的只是模糊的想法,所有这些想法都是复杂的,并且不具有说服力。

    >解决方案

    一个可能的方法是使用超级累加器来计算总和:这是一个计算浮点数的精确和的算法。虽然这些想法已经存在了一段时间,但是这个术语是一个相对较新的概念。

    从某种意义上讲,你可以把它看成是Kahan求和的一个扩展,顺序总和存储为一个值的数组,而不是一对。然后,主要的挑战就是弄清楚如何在各种值之间分配精度。

    一些相关的论文和代码:

    Y $。

  • K. Zhu和W. B. Hayes。 算法908:浮点数据流在线精确求和。 ACM数学软件交易(ACM TOMS),37(3):37:1-37:13,2010年9月。doi:


  • $ b $ c ++代码 b
  • R。 M. Neal,使用小型和大型超级累积器进行快速精确求和。 arXiv: 1505.05571




    • < a c href =https://arxiv.org/src/1505.05571v1>可用的C代码


  • 微米。 T.Goodrich,A.Eldawy用于求和浮点数的并行算法。 arXIV: 1605.05436




    • <一个href =https://github.com/aseldawy/sumn>这个和上面的Java代码


  • ul>

    Assume You're given two sets of floating point variables implemented according to IEEE754, meant to be treated as exact values calculated according to formulae present in standard. All legal values are possible. The amount of variables in set may be any natural number.

    What would be a good way to compare exact, in mathematical sense, sums of values represented by said variables. Due to domain's nature, the problem can easily be represented as comparing a single sum to zero. You can disregard the possibility of presence of NaNs or Infinities, as it is irrelevant to core problem. (Those values can be checked for easily and independently, and acted upon in a manner suiting particular application of this problem.)

    A naive approach would be to simply sum and compare, or sum values of one set and subtract values of another.

        bool compare(const std::vector<float>& lhs, const std::vector<float>& rhs)
        {
            float lSum = 0.0f;
            for (auto value : lhs)
            {
                lSum += value;
            }
            float rSum = 0.0f;
            for (auto value : rhs)
            {
                rSum += value;
            }
    
            return lSum < rSum;
        }
    

    Quite obviously there are problems with naive approach, as mentioned in various other questions regarding floating point arithmetic. Most of the problems are related to two difficulties:

    • result of addition of floating point values differs depending on order
    • certain orders of addition of certain sets of values may result in intermediate overflow (intermediate result of calculations goes beyond range supported by available data type)

      float small = strtof("0x1.0p-126", NULL);
      float big = strtof("0x1.8p126", NULL);
      
      std::cout << std::hexfloat << small + big - big << std::endl;
      std::cout << std::hexfloat << (big-2*small) + (big-small) + big - (big+small) - (big+2*small) << std::endl;
      

      This code will result in 0 and inf; this illustrates how ordering affects the result. Hopefully, also that the problem of ordering is non-trivial.

      float prev;
      float curr = 0.0f;
      
      do
      {
          prev = curr;
          curr += strtof("0x1.0p-126", NULL);
      } while (prev != curr);
      
      std::cout << std::hexfloat << curr << std::endl;
      

    This code, given sufficient time to actually finish computing, would result in 0x1.000000p-102, not, as could be naively expected, 0x1.fffffep127 (Change of curr initialization to `strtof("0x1.fff000p-103") would be advised to actually observe this.); this illustrates how proportion between intermediate results of addition and particular addends affects the result.

    A lot has been said about obtaining best precision, eg. this question.

    The problem at hand differs in that we do not want to maximize precision, but we have a well-defined function that needs to be implemented exactly.

    While for some the idea that it may be useful exercise seems controversial at best, consider the following scenario: comparison between those value sets could be a cornerstone of other operations performed on entire datasets independently in various environments. Synchronized, flawless operation of some systems may depend on this comparison being well defined and deterministically implemented, irregardless of addends order and particular architecture implementing IEEE754 or not.

    This, or just curiosity.

    In the discussion, Kahan summation algorithm has been mentioned as relevant. However this algorithm is a reasonable attempt at minimizing error. It neither guarantees correct sign of result, nor is independent of the order of operations (to at least guarantee consistent, if wrong, result, for permutations of sets).

    One of the most obvious solutions would be to employ/implement fixed point arithmetic using sufficient amount of bits to represent every possible operand value exactly and keep exact intermediate result.

    Perhaps however this can be done using only floating point arithmetic in a manner that guarantees correct sign of result. If so, the problem of overflow (as illustrated in one of the examples above) needs to be addressed in solution, as this question has particular technical aspect.

    (What follows is original question.)

    I have two sets of multiple floating point (float or double) values. I want to provide a perfect answer to the question, which set has larger sum. Because of artifacts in floating point arithmetic, in some corner cases the result of naive approach may be wrong, depending on order of operations. Not to mention simple sum can result in overflow. I can't provide any effort on my side, because all I have is vague ideas, all of them complicated and not convincing.

    解决方案

    One possible approach is to compute the sum using a superaccumulator: this is an algorithm for computing exact sums of floating point numbers. Although these ideas have been around for a while, the term is a relatively new one.

    In some sense, you can think of it as an extension of Kahan summation, where the sequential sum is stored as an array of values, rather than just a pair. The main challenge then becomes figuring out how to allocate the precision amongst the various values.

    Some relevant papers and code:

    • Y. K. Zhu and W. B. Hayes. "Algorithm 908: Online Exact Summation of Floating-Point Streams". ACM Transactions on Mathematical Software (ACM TOMS), 37(3):37:1-37:13, September 2010. doi: 10.1145/1824801.1824815

      • Unfortunately the paper and code are behind a paywall, but this appears to be the C++ code.
    • R. M. Neal, "Fast Exact Summation using Small and Large Superaccumulators". 2015. arXiv: 1505.05571

    • M. T. Goodrich, A. Eldawy "Parallel Algorithms for Summing Floating-Point Numbers". 2016. arXiv: 1605.05436

    这篇关于在C或C ++中比较两个浮点值的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆