如何总和大量的浮点数? [英] how to sum a large number of float number?

查看:168
本文介绍了如何总和大量的浮点数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我建立一个并行和代码来总和大量的浮点数,然后我发现当数字大于100000000,结果会出错。然后我构建一个串行代码进行比较。序列号也得到错误的数字。任何人都知道为什么?谢谢!



我的简单代码如下。



结果是1.67772e + 007。应为1e + 008

  int main()
{
size_t N = 100000000;
cout<<n is:<<<<<< endl;
clock_t start = clock();
task_scheduler_init init;
vector< float> myvec;
vector< float> * pvec;
for(int i = 0; i myvec.push_back(1.0f);
pvec =& myvec;
float mysum;
mysum = parallelSum(pvec);
cout<<p sum is:<< mysum<< endl;
clock_t finish = clock();
cout<<Time Used =<<(finish - start)/ CLOCKS_PER_SEC<< endl;
mysum = 0;
for(int i = 0; i mysum + = myvec [i];
cout<<s sum是:<< mysum<< endl;
return 0;
}


解决方案

您的问题是由于

$

  1.0f + 1.0f == 2.0f,

您会发现

  16777216.0f + 1.0f == 16777216.0f 

额外的1.0f只是因为16777217无法用 float 格式来表示。



你的结果 - 1.67772e + 007 - 很清楚,这是发生了什么。



你的预期结果100000000.0是相当多(6x)大于16777216.0f ,但是一旦总和达到总共16777216.0f,它保持在其余的8327884添加1.0f。



解决方案:尝试使用 double 而不是 float ,在找到此问题之前,它可以达到 9007199254740992.0

为什么?



在单精度浮点中只有24位精度可用,2 ^ 24是16777216。是没有办法编码16777217在24位,所以它只是保持在16777216,在推论是足够近真正的答案。真正的问题出现在你把一个很小的数字添加到一个大数字,其中小数字的总和相对于大数字是显着的,但单独它们不是。


请注意,16777216.0f不是
float 中可以表示为
的最大数字,
的精度限制。例如,
16777216.0fx 2 ^ 4 + 2 ^ 4 => 16777216.0fx 2 ^ 4


double 有53位的精度,所以可以编码高达2 ^ 53或 9007199254740992.0 c> 1.0d 失败。






此问题还表示并行浮点运算 - 浮点加法不是关联的,换句话说,您的顺序算法:

  )=(...((((A1 + A2)+ A3)+ A4)... A10000)


$ b b

可能产生与并行版本不同的结果:

  Sum(A)=(... (A1 + A2)+ A3)+ A4)... A1000)
+(...(((A1001 + A1002)+ A1003)+ A1004)... A2000)
+ ...(((A9001 + A9002)+ A9003)+ ...((((A0011 + A2002)+ A2003)+ A2004)... A3000)
...
+ A9004)... A10000)

因为任何给定的步骤可能在不同程度上失去精度。 p>

这并不意味着更正确,但可能会得到意想不到的结果。






如果您真的必须使用 float ,请尝试以下操作:




  • 将数字从最负值到最正值排序(order(N log N))

  • 依次添加每一对 - B1:= A1 + B2,B2:= A1 + A2,B3:= A3 + A4
    这将产生一个新列表B,长度为初始长度的一半

  • 在B上重复此过程



p>请注意,这会将算法复杂度从O(N)操作更改为O(N log N)操作,但更有可能生成正确的数字。它是相当可并行的。如果你很聪明,你可以合并排序和求和操作。


I build a paralle sum code to sum a large number of float numbers then I found when the number of figures is bigger than 100000000, the result will go wrong. Then I build a serial code to compare. The serial code also get wrong number. Anybody knows why? Thanks!

my simple code is as follows.

the result is " 1.67772e+007". It should be 1e+008

int main()
{
    size_t N=100000000;
    cout<<"n is : "<<N<<endl;
    clock_t start = clock();
    task_scheduler_init init;
    vector<float> myvec;
    vector<float>* pvec;
    for(int i=0;i<N;i++)
        myvec.push_back(1.0f);
    pvec=&myvec;
    float mysum;
    mysum=parallelSum(pvec);
    cout<<" the p sum is: "<<mysum<<endl;
    clock_t finish = clock();
        cout<<"Time Used  = "<<(finish - start)/CLOCKS_PER_SEC<<endl;
        mysum=0;
       for(int i=0;i<N;i++)
    mysum+=myvec[i];
        cout<<" the s sum is: "<<mysum<<endl;
    return 0;
}

解决方案

Your problem is due to the limited available precision of floating point numbers.

While

1.0f + 1.0f == 2.0f, 

You will find that

16777216.0f + 1.0f == 16777216.0f

The extra 1.0f is just thrown away since 16777217 cannot be represented in float format.

Looking at your result – 1.67772e+007 – it's clear that this is exactly what has happened.

Your expected result, 100000000.0, is quite a lot (6x) bigger than 16777216.0f, but once the sum reaches a total of 16777216.0f it stays there for the remaining 8327884 additions of 1.0f.

Solution: Try using double instead of float, which goes up to 9007199254740992.0 before hitting this problem.

Why?

In single precision floating point there are only 24 bits of precision available, and 2^24 is 16777216. There is no way to encode 16777217 in 24 bits, so it simply stays at 16777216, on the reasoning that it's close enough to the real answer. The real problem arises when you add lots of very small numbers to a big number, where the sum of the small numbers is signifiant relative to the big one, but individually they are not.

Note that 16777216.0f is not the largest number that can be represented in float, but just represents the limit of precision. For example, 16777216.0f x 2^4 + 2^4 => 16777216.0f x 2^4

double has 53 bits of precision, so can encode up to 2^53, or 9007199254740992.0 before hitting the point where adding 1.0d fails.


This issue also represents another hazard for parallelising floating point operations - floating point addition is not associative, in other words, your sequential algorithm:

Sum(A) = (...((((A1 + A2) + A3) + A4) ... A10000)

May produce a different result from the parallelised version:

Sum(A) = (...((((A1 + A2) + A3) + A4) ... A1000)
       + (...((((A1001 + A1002) + A1003) + A1004) ... A2000)
       + (...((((A2001 + A2002) + A2003) + A2004) ... A3000)
       ...
       + (...((((A9001 + A9002) + A9003) + A9004) ... A10000)

since any given step may lose precision to a different degree.

This does not mean that either is more right, but that you may get unexpected results.


If you really have to use float, try the following:

  • sort your numbers from most negative to most positive (order (N log N))
  • add each pair in turn - B1 := A1 + B2, B2 := A1 + A2, B3 := A3 + A4 this produces a new list B, half the length of the initial one
  • repeat this procedure on B to get C, C to get D, etc
  • stop when there is only one number left.

Note that this changes your algorithmic complexity from an O(N) operation to an O(N log N) operation, but it's rather more likely to produce the correct number. It's quite parallelisable. You may be able to merge the sort and sum operations if you are clever about it.

这篇关于如何总和大量的浮点数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆