为什么两个浮点类型变量具有不同的值 [英] Why two float type variables have different values

查看:94
本文介绍了为什么两个浮点类型变量具有不同的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个接近1000大小的整数向量,我要做的是检查这两个向量的平方整数的和是否相同。所以我写下面的代码:

I have two integer vectors of nearly 1000 size, and what I am going to do is to check whether the sum of the square integer for these two vectors is the same or not. So I write the following codes:

std::vector<int> array1;
std::vector<int> array2;
... // initialize array1 and array2, and in the experiment all elements
    // in the two vectors are the same but the sequence of elements may be different.
    // For example: array1={1001, 2002, 3003, ....} 
   //               array2={2002, 3003, 1001, ....}
assert(array1.size() == array2.size());
float sum_array1 = 0;
float sum_array2 = 0;
for(int i=0; i<array1.size(); i++)
       sum_array1 +=array1[i]*array1[i];
for(int i=0; i<array2.size(); i++)
       sum_array2 +=array2[i]*array2[i];

我希望 sum_array1 sum_array2 ,但事实上在我的应用程序中,我发现他们是不同的 sum_array1 = 1.2868639e + 009 c> sum_array2 = 1.2868655e + 009 。接下来我做的是将 sum_array1 sum_array2 的类型更改为double类型,如下所示: / p>

I expect that sum_array1 should be equal to sum_array2, but in fact in my application I found they were different sum_array1 = 1.2868639e+009 while sum_array2 = 1.2868655e+009. What I have done next is to change the type of sum_array1 and sum_array2 to double type as the following codes show:

 double sum_array1 = 0;
    double sum_array2 = 0;
    for(int i=0; i<array1.size(); i++)
           sum_array1 +=array1[i]*array1[i];
    for(int i=0; i<array2.size(); i++)
           sum_array2 +=array2[i]*array2[i];

此时 sum_array1 等于 sum_array2 sum_array1 = sum_array2 = 1286862225.0000000 。我的问题是为什么会发生。谢谢。

This time sum_array1 is equal to sum_array2 sum_array1=sum_array2=1286862225.0000000. My question is why it could happen. Thanks.

推荐答案

浮点值具有有限的大小,因此只能以有限精度表示实际值。

Floating point values have a finite size, and can therefore only represent real values with a finite precision. This leads to rounding errors when you need more precision than they can store.

特别是,当添加一个小数字(如那些你在求和)到一个很多较大的数字(例如你的累加器),精度的损失可能相对较小的数字,给出一个显着的误差;

In particular, when adding a small number (such as those you're summing) to a much larger number (such as your accumulator), the loss of precision can be quite large compared with the small number, giving a significant error; and the errors will be different depending on the order.

通常, float 有24位精度,对应于约7位小数。累加器需要10个小数位(约30位),因此您将遇到这种精度的损失。通常, double 有53位(约16个小数位),因此您的结果可以准确表示。

Typically, float has 24 bits of precision, corresponding to about 7 decimal places. Your accumulator requires 10 decimal places (around 30 bits), so you will experience this loss of precision. Typically, double has 53 bits (about 16 decimal places), so your result can be represented exactly.

A 64位整数可能是最好的选择,因为所有的输入都是整数。使用整数避免了精度损失,但如果输入太多或太大,则会导致溢出的危险。

A 64-bit integer may be the best option here, since all the inputs are integers. Using an integer avoids loss of precision, but introduces a danger of overflow if the inputs are too many or too large.

如果不能使用足够宽的累加器,可以对输入进行排序,使最小值先累加;或者您可以使用更复杂的方法,例如 Kahan总结

To minimise the error if you can't use a wide enough accumulator, you could sort the input so that the smallest values are accumulated first; or you could use more complicated methods such as Kahan summation.

这篇关于为什么两个浮点类型变量具有不同的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆