转换为科学记数法时的双精度误差 [英] double precision error when converting to scientific notation
问题描述
我正在建立一个程序来将双精度值转换为科学值格式(尾数,指数)。然后我注意到了以下
: 369.7900000000000 - > 3.6978999999999997428
68600000 - > 6.8599999999999994316
我注意到了其他几个值的相同模式。最大分数错误为
0.000 000 000 000 001 = 1 * e-15
我知道在计算机中表示双精度值的不准确性。可以得出结论,我们得到的最大分数误差是 1 * e-15
?这是什么有意义的?
我经历了关于浮点精度问题在堆栈溢出的大多数问题,但我没有看到任何关于最大分数错误在64位。
为了清楚我的计算,我已经提到我的代码片段
double norm = 68600000;
if(norm)
{
while(norm> = 10.0)
{
norm / = 10.0;
exp ++;
}
while(norm< 1.0)
{
norm * = 10.0;
exp--;
}
}
现在我得到
norm = 6.8599999999999994316;
exp = 7
获取与 double $ c $的机器εilon相关c>数据类型。
A double
为64位长,1位表示符号,11位表示指数,以及尾数分数的52位。 double
的值由
1.mmmmm 。*(2 ^ exp)
尾数只有52位,任何添加到
1.0
期满时, 2 ^ -52
其意义很小。在二进制中, 1.0 + 2 ^ -52
将是
1.000。 ..00 + 0.000 ... 01 = 1.000 ..... 01
不改变 1.0
的值。您可以自行验证程序中的 1.0 + 2 ^ -53 == 1.0
。
code> 2 ^ -52 = 2.22e-16 称为机器ε,是在一个浮点算术期间发生的相对误差的上限
类似地, float
在其尾数中有23位,因此其机器ε是 2 ^ -23 = 1.19e-7
。
您得到 1e-15
的原因可能是因为执行许多算术运算时错误累积,但我不能说,因为我不知道
首先,您可能有兴趣知道 round-off 错误可能会改变计算结果,如果您将其分成以下步骤:
686.0 / 10.0 = 68.59999999999999431566
686.0 / 10.0 / 10.0 = 6.85999999999999943157
686.0 / 100.0 = 6.86000000000000031974
b $ b
在第一行中,最接近的 double
到68.6低于实际值,但在第三行,我们看到最接近的 double
到6.86更大。
如果我们看看 abosolute错误 e_abs = abs(v-v_approx)
,我们看到它是
6.8600000 - 6.85999999999999943156581139192 〜= 5.684e-16
但是,相对错误 e_abs = abs((v-v_approx)/ v)= abs(e_abs / v)
将
code> 5.684e-16 / 6.86〜= 8.286e-17
我们的机器epsilon的 2.22e-16
。
这是一个着名的文件,你可以阅读,如果你想知道所有的浮点算术的细节。
I'm building a program to to convert double values in to scientific value format(mantissa, exponent). Then I noticed the below
369.7900000000000 -> 3.6978999999999997428
68600000 -> 6.8599999999999994316
I noticed the same pattern for several other values also. The maximum fractional error is
0.000 000 000 000 001 = 1*e-15
I know the inaccuracy in representing double values in a computer. Can this be concluded that the maximum fractional error we would get is 1*e-15
? What is significant about this?
I went through most of the questions on floating point precision problem in stack overflow, but I didnt see any about the maximum fractional error in 64 bits.
To be clear on the computation I do, I have mentioned my code snippet as well
double norm = 68600000;
if (norm)
{
while (norm >= 10.0)
{
norm /= 10.0;
exp++;
}
while (norm < 1.0)
{
norm *= 10.0;
exp--;
}
}
Now I get
norm = 6.8599999999999994316;
exp = 7
The number you are getting is related to the machine epsilon for the double
data type.
A double
is 64 bits long, with 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa fraction. A double
's value is given by
1.mmmmm... * (2^exp)
With only 52 bits for the mantissa, any double
value below 2^-52
will be completely lost when added to 1.0
due to its small significance. In binary, 1.0 + 2^-52
would be
1.000...00 + 0.000...01 = 1.000.....01
Obviously anything lower would not change the value of 1.0
. You can verify for yourself that 1.0 + 2^-53 == 1.0
in a program.
This number 2^-52 = 2.22e-16
is called the machine epsilon and is an upper bound on the relative error that occurs during one floating point arithmetic due to round-off error with double
values.
Similarly, float
has 23 bits in its mantissa and so its machine epsilon is 2^-23 = 1.19e-7
.
The reason you are getting 1e-15
may be because errors accumulate as you perform many arithmetic operations, but I can't say because I don't know the exact calculations you are doing.
EDIT: I've looked into the relative error for your problem with 68600000.
First off, you may be interested to know that round-off error can change the result of your computation if you break it into steps:
686.0/10.0 = 68.59999999999999431566
686.0/10.0/10.0 = 6.85999999999999943157
686.0/100.0 = 6.86000000000000031974
In the first line, the closest double
to 68.6 is lower than the actual value, but in the third line we see the closest double
to 6.86 is greater.
If we look at the abosolute error e_abs = abs(v-v_approx)
of your program, we see that it is
6.8600000 - 6.85999999999999943156581139192 ~= 5.684e-16
However, the relative error e_abs = abs( (v-v_approx)/ v) = abs(e_abs/v)
would be
5.684e-16 / 6.86 ~= 8.286e-17
Which is indeed below our machine epsilon of 2.22e-16
.
This is a famous paper you can read if you want to know all the details about floating point arithmetic.
这篇关于转换为科学记数法时的双精度误差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!