转换为科学记数法时的双精度误差 [英] double precision error when converting to scientific notation

查看：400 发布时间：2016/10/29 20:16:59 c++ unix precision double-precision scientific-notation

本文介绍了转换为科学记数法时的双精度误差的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在建立一个程序来将双精度值转换为科学值格式（尾数，指数）。然后我注意到了以下

：

  369.7900000000000  - > 3.6978999999999997428 
 
 68600000  - > 6.8599999999999994316

我注意到了其他几个值的相同模式。最大分数错误为

0.000 000 000 000 001 = 1 * e-15

我知道在计算机中表示双精度值的不准确性。可以得出结论，我们得到的最大分数误差是 1 * e-15 ？这是什么有意义的？

我经历了关于浮点精度问题在堆栈溢出的大多数问题，但我没有看到任何关于最大分数错误在64位。

为了清楚我的计算，我已经提到我的代码片段

  double norm = 68600000; 
 if（norm）
 {
 while（norm> = 10.0）
 {
 norm / = 10.0; 
 exp ++; 
} 
 while（norm< 1.0）
 {
 norm * = 10.0; 
 exp--; 
} 
}

现在我得到

  norm = 6.8599999999999994316; 
 exp = 7

解决方案

获取与 double 机器εilon相关c>数据类型。

 
 
  A  double 为64位长，1位表示符号，11位表示指数，以及尾数分数的52位。  double 的值由
  1.mmmmm 。*（2 ^ exp）
  
尾数只有52位，任何添加到 1.0 期满时，的值会低于 2 ^ -52 其意义很小。在二进制中， 1.0 + 2 ^ -52 将是
  1.000。 ..00 + 0.000 ... 01 = 1.000 ..... 01 
  
不改变 1.0 的值。您可以自行验证程序中的 1.0 + 2 ^ -53 == 1.0 。
 
 
  code> 2 ^ -52 = 2.22e-16 称为机器ε，是在一个浮点算术期间发生的相对误差的上限 
 
 
类似地， float 在其尾数中有23位，因此其机器ε是 2 ^ -23 = 1.19e-7 。
 
 
 您得到 1e-15 的原因可能是因为执行许多算术运算时错误累积，但我不能说，因为我不知道
  ：我已经查看了您的相关错误问题68600000。
 
 
 首先，您可能有兴趣知道 round-off 错误可能会改变计算结果，如果您将其分成以下步骤：
  686.0 / 10.0 = 68.59999999999999431566 
 686.0 / 10.0 / 10.0 = 6.85999999999999943157 
 686.0 / 100.0 = 6.86000000000000031974 
  
 
 b $ b 
在第一行中，最接近的 double 到68.6低于实际值，但在第三行，我们看到最接近的 double 到6.86更大。
 
 
 如果我们看看 abosolute错误  e_abs = abs（v-v_approx），我们看到它是
  6.8600000  -  6.85999999999999943156581139192 〜= 5.684e-16 
  
但是，相对错误  e_abs = abs（（v-v_approx）/ v）= abs（e_abs / v）将

 code> 5.684e-16 / 6.86〜= 8.286e-17

我们的机器epsilon的 2.22e-16 。

这是一个着名的文件，你可以阅读，如果你想知道所有的浮点算术的细节。

I'm building a program to to convert double values in to scientific value format(mantissa, exponent). Then I noticed the below

369.7900000000000 -> 3.6978999999999997428

68600000 -> 6.8599999999999994316

I noticed the same pattern for several other values also. The maximum fractional error is

0.000 000 000 000 001 = 1*e-15

I know the inaccuracy in representing double values in a computer. Can this be concluded that the maximum fractional error we would get is 1*e-15? What is significant about this?

I went through most of the questions on floating point precision problem in stack overflow, but I didnt see any about the maximum fractional error in 64 bits.

To be clear on the computation I do, I have mentioned my code snippet as well

double norm = 68600000;
if (norm)
{
    while (norm >= 10.0)
    {
      norm /= 10.0;
      exp++;
    }
    while (norm < 1.0)
    {
      norm *= 10.0;
      exp--;
    }
}

Now I get

norm = 6.8599999999999994316;
exp = 7

解决方案

The number you are getting is related to the machine epsilon for the double data type.

A double is 64 bits long, with 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa fraction. A double's value is given by

1.mmmmm... * (2^exp)

With only 52 bits for the mantissa, any double value below 2^-52 will be completely lost when added to 1.0 due to its small significance. In binary, 1.0 + 2^-52 would be

1.000...00  + 0.000...01  = 1.000.....01

Obviously anything lower would not change the value of 1.0. You can verify for yourself that 1.0 + 2^-53 == 1.0 in a program.

This number 2^-52 = 2.22e-16 is called the machine epsilon and is an upper bound on the relative error that occurs during one floating point arithmetic due to round-off error with double values.

Similarly, float has 23 bits in its mantissa and so its machine epsilon is 2^-23 = 1.19e-7.

The reason you are getting 1e-15 may be because errors accumulate as you perform many arithmetic operations, but I can't say because I don't know the exact calculations you are doing.

EDIT: I've looked into the relative error for your problem with 68600000.

First off, you may be interested to know that round-off error can change the result of your computation if you break it into steps:

686.0/10.0      = 68.59999999999999431566
686.0/10.0/10.0 = 6.85999999999999943157
686.0/100.0     = 6.86000000000000031974

In the first line, the closest double to 68.6 is lower than the actual value, but in the third line we see the closest double to 6.86 is greater.

If we look at the abosolute error e_abs = abs(v-v_approx) of your program, we see that it is

6.8600000 - 6.85999999999999943156581139192 ~= 5.684e-16

However, the relative error e_abs = abs( (v-v_approx)/ v) = abs(e_abs/v) would be

5.684e-16 / 6.86  ~=  8.286e-17

Which is indeed below our machine epsilon of 2.22e-16.

This is a famous paper you can read if you want to know all the details about floating point arithmetic.

这篇关于转换为科学记数法时的双精度误差的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

转换为科学记数法时的双精度误差 [英] double precision error when converting to scientific notation

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

转换为科学记数法时的双精度误差 [英] double precision error when converting to scientific notation

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭