这些双精度值如何精确到20位小数? [英] How are these double precision values accurate to 20 decimals?

查看：201 发布时间：2020/6/12 19:23:43 fortran precision

本文介绍了这些双精度值如何精确到20位小数?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当精度成为问题时，我正在测试一些非常简单的等价错误，并希望以扩展的双精度执行操作(以便我知道答案在19位左右)，然后以双精度执行相同的操作精度(第16位会有舍入误差)，但是我的双精度算术以某种方式保持了19位精度.

当我执行扩展双精度运算，然后将数字硬编码到另一个Fortran例程中时，我得到了预期的错误，但是当我在此处将扩展双精度精度变量分配给双精度变量时，发生了什么奇怪的事情吗?/p>

program code_gen
    implicit none 
    integer, parameter :: Edp = selected_real_kind(17)
    integer, parameter :: dp = selected_real_kind(8)
    real(kind=Edp) :: alpha10, x10, y10, z10 
    real(kind=dp) :: alpha8, x8, y8, z8

    real(kind = dp) :: pi_dp = 3.1415926535897932384626433832795028841971693993751058209749445

    integer :: iter
    integer :: niters = 10

    print*, 'tiny(x10) = ', tiny(x10)
    print*, 'tiny(x8)  = ', tiny(x8)
    print*, 'epsilon(x10) = ', epsilon(x10)
    print*, 'epsilon(x8)  = ', epsilon(x8)

    do iter = 1,niters
        x10 = rand()
        y10 = rand()
        z10 = rand()
        alpha10 = x10*(y10+z10)

        x8 = x10 
        x8 = x8 - pi_dp
        x8 = x8 + pi_dp
        y8 = y10 
        y8 = y8 - pi_dp
        y8 = y8 + pi_dp
        z8 = z10 
        z8 = z8 - pi_dp
        z8 = z8 + pi_dp
        alpha8 = alpha10

        write(*, '(a, es30.20)') 'alpha8 .... ', x8*(y8+z8)
        write(*, '(a, es30.20)') 'alpha10 ... ', alpha10

        if( alpha8 .gt. x8*(y8+z8) ) then
            write(*, '(a)') 'ERROR(.gt.)'
        elseif( alpha8 .lt. x8*(y8+z8) ) then
            write(*, '(a)') 'ERROR(.lt.)'
        endif
    enddo
end program code_gen

其中rand()是在此处找到的gfortran函数.

如果我们仅谈论一种精度类型(例如，双精度)，那么我们可以将机器epsilon表示为E16，其近似为2.22E-16.如果我们简单地将两个实数x+y相加，则生成的机器表示的数为(x+y)*(1+d1)，其中abs(d1) < E16.同样，如果我们再将该数字乘以z，则结果值实际上是(z*((x+y)*(1+d1))*(1+d2))，几乎等于(z*(x+y)*(1+d1+d2))，其中abs(d1+d2) < 2*E16.如果现在移到扩展的双精度，则唯一改变的是E16变为E20，其值大约为1.08E-19.

我希望以扩展的双精度执行分析，以便我可以比较两个应该相等的数字，但表明，舍入误差有时会导致比较失败.通过分配x8=x10，我希望创建扩展的双精度值x10的双精度版本"，其中只有x8的前〜16位数字与x10的值一致，但是在打印时取值之后，它表明所有20位数字都是相同的，并且预期的双精度舍入误差没有发生，正如我所期望的那样.

还应注意，在进行此尝试之前，我编写了一个程序，该程序实际上编写了另一个程序，其中x，y和z的值被硬编码"为小数点后20位.在该程序的此版本中，.gt.和.lt.的比较按预期失败，但是我无法通过将扩展的双精度值强制转换为双精度变量来复制相同的故障.

为了进一步干扰"双精度值并添加舍入误差，我从双精度变量中添加了pi，然后减去了该值，这将使其余变量具有一些双精度舍入误差，但是我仍然没有在最终结果中看到这一点.

解决方案

正如您链接的gfortran文档所述，rand的函数结果是默认的实数值(单精度).这样的值可以由您的每个其他实类型精确表示.

也就是说，x10=rand()将单个精度值分配给扩展精度变量x10.确实如此.现在，存储在x10中的相同值已分配给双精度变量x8，但这仍然可以精确地表示为双精度.

单精度双精度中有足够的精度，使得使用双精度和扩展类型的计算返回相同的值. [请参阅此答案末尾的注释.]

如果您希望看到精度损失的实际影响，则可以使用扩展精度或双精度值开始.例如，不要使用rand(返回单个精度值)，而应使用固有的random_number

call random_number(x10)

(具有作为标准Fortran的优点).与函数(几乎)在所有情况下都返回值类型而不管该值的最终用途不同，该子例程将为您提供与参数相对应的精度.您(希望)会从硬编码"实验中看到很多东西.

或者，如agentp所述，从双精度值开始可能更直观

call random_number(x8); x10=x8   ! x8 and x10 have the precision of double precision
call random_number(y8); y10=y8
call random_number(z8); z10=z8

并从该起点开始进行计算:这些多余的比特将开始显示.

总而言之，当您执行x8=x10时，会得到x8的前几位，与x10的前几位相对应，但是其中很多位以及在x10之后的所有位都为零.

当涉及到pi_dp扰动时，您再次将单精度(这是文字常量)值分配给双精度变量.仅拥有所有这些数字就不能使它成为默认的真实文字.您可以使用_Edp后缀指定其他类型的文字，如其他答案所述.

最后，然后，人们还不得不担心编译器在关于优化方面的作用.

我的观点是，从单精度值开始，所执行的计算可以精确地表示为双精度和扩展精度(具有相同的值).对于其他计算，或者从设置了更多位的起点或表示形式(例如，在某些系统或其他编译器上)，类型为selected_real_kind(17)的数值类型可能具有完全不同的特征(例如不同的基数)，这些不需要就是这种情况.

虽然这主要是基于猜测，并希望它能解释观察结果.幸运的是，有很多方法可以验证这个想法.当我们谈论IEEE算术时，我们可以考虑不精确标志.如果在计算过程中未提出该标志，我们将很高兴.

使用gfortran，有一个编译选项-ffpe=inexact，它将使不精确的标志发出信号.使用gfortran 5.0，支持内部模块ieee_exceptions，可以以可移植/标准的方式使用该模块.

您可以考虑将该标志用于进一步的实验:如果将其提高，则可以期望看到两种精度之间的差异.

I am testing some very simple equivalence errors when precision is an issue and was hoping to perform the operations in extended double precision (so that I knew what the answer would be in ~19 digits) and then perform the same operations in double precision (where there would be roundoff error in the 16th digit), but somehow my double precision arithmetic is maintaining 19 digits of accuracy.

When I perform the operations in extended double, then hardcode the numbers into another Fortran routine, I get the expected errors, but is there something strange going on when I assign an extended double precision variable to a double precision variable here?

program code_gen
    implicit none 
    integer, parameter :: Edp = selected_real_kind(17)
    integer, parameter :: dp = selected_real_kind(8)
    real(kind=Edp) :: alpha10, x10, y10, z10 
    real(kind=dp) :: alpha8, x8, y8, z8

    real(kind = dp) :: pi_dp = 3.1415926535897932384626433832795028841971693993751058209749445

    integer :: iter
    integer :: niters = 10

    print*, 'tiny(x10) = ', tiny(x10)
    print*, 'tiny(x8)  = ', tiny(x8)
    print*, 'epsilon(x10) = ', epsilon(x10)
    print*, 'epsilon(x8)  = ', epsilon(x8)

    do iter = 1,niters
        x10 = rand()
        y10 = rand()
        z10 = rand()
        alpha10 = x10*(y10+z10)

        x8 = x10 
        x8 = x8 - pi_dp
        x8 = x8 + pi_dp
        y8 = y10 
        y8 = y8 - pi_dp
        y8 = y8 + pi_dp
        z8 = z10 
        z8 = z8 - pi_dp
        z8 = z8 + pi_dp
        alpha8 = alpha10

        write(*, '(a, es30.20)') 'alpha8 .... ', x8*(y8+z8)
        write(*, '(a, es30.20)') 'alpha10 ... ', alpha10

        if( alpha8 .gt. x8*(y8+z8) ) then
            write(*, '(a)') 'ERROR(.gt.)'
        elseif( alpha8 .lt. x8*(y8+z8) ) then
            write(*, '(a)') 'ERROR(.lt.)'
        endif
    enddo
end program code_gen

where rand() is the gfortran function found here.

If we are speaking about only one precision type (take, for example, double), then we can denote machine epsilon as E16 which is approximately 2.22E-16. If we take a simple addition of two Real numbers, x+y, then the resulting machine expressed number is (x+y)*(1+d1) where abs(d1) < E16. Likewise, if we then multiply that number by z, the resulting value is really (z*((x+y)*(1+d1))*(1+d2)) which is nearly (z*(x+y)*(1+d1+d2)) where abs(d1+d2) < 2*E16. If we now move to extended double precision, then the only thing that changes is that E16 turns to E20 and has a value of around 1.08E-19.

My hope was to perform the analysis in extended double precision so that I could compare two numbers which should be equal but show that, on occasion, roundoff error will cause comparisons to fail. By assigning x8=x10, I was hoping to create a double precision 'version' of the extended double precision value x10, where only the first ~16 digits of x8 conform to the values of x10, but upon printing out the values, it shows that all 20 digits are the same and the expected double precision roundoff error is not occurring as I would expect.

It should also be noted that before this attempt, I wrote a program which actually writes another program where the values of x, y, and z are 'hardcoded' to 20 decimal places. In this version of the program, the comparisons of .gt. and .lt. failed as expected, but I am not able to duplicate the same failures by casting an extended double precision value as a double precision variable.

In an attempt to further 'perturb' the double precision values and add roundoff error, I have added, then substracted, pi from my double precision variables which should leave the remaining variables with some double precision roundoff error, but I am still not seeing that in the final result.

解决方案

As the gfortran documentation you link states, the function result of rand is a default real value (single precision). Such a value can be represented exactly by each of your other real types.

That is, x10=rand() assigns a single precision value to the extended precision variable x10. It does so exactly. This same value now stored in x10 is assigned to the double precision variable x8, but this remains exactly representable as double precision.

There is sufficient precision in the single-as-double that the calculations using double and extended types return the same value. [See the note at the end of this answer.]

If you wish to see real effects of loss of precision, then start by using an extended or double precision value. For example, rather than using rand (returning a single precision value), use the intrinsic random_number

call random_number(x10)

(which has the advantage of being standard Fortran). Unlike a function, which in (nearly) all cases returns a value type regardless of the end use of the value, this subroutine will give you a precision corresponding to the argument. You will (hopefully) see much as you will from your "hard-coded" experiment.

Alternatively, as agentp commented, it may be more intuitive to start with a double precision value

call random_number(x8); x10=x8   ! x8 and x10 have the precision of double precision
call random_number(y8); y10=y8
call random_number(z8); z10=z8

and perform the calculations from that starting point: those extra bits will then start to show.

In summary, when you do x8=x10 you are getting the first few bits of x8 corresponding to those of x10, but many of those bits and those that follow in x10 are all zero.

When it comes to your pi_dp perturbation, you are again assigning a single precision (this time a literal constant) value to a double precision variable. Just having all those digits doesn't make it anything other than a default real literal. You can specify a different kind of literal with a _Edp suffix, as described in other answers.

Finally, one also then has to worry about what the compiler does with regards to optimization.

My thesis is that starting from the single precision value, the calculations performed are representable exactly in both double and extended precision (with the same values). For other calculations, or from a starting point with more bits set, or representations (for example, on some systems or with other compilers the numeric type with kind selected_real_kind(17) may have quite different characteristics such as a different radix) that needn't be the case.

While this was largely based on guessing and hoping it explained the observation. Fortunately, there are ways to test this idea. As we're talking about IEEE arithmetic we can consider the inexact flag. If that flag isn't raised during the computation we can be happy.

With gfortran there is the compilation option -ffpe=inexact which will make the inexact flag signalling. With gfortran 5.0 the intrinsic module ieee_exceptions is supported which can be used in a portable/standard manner.

You can consider this flag for further experimentation: if it is raised then you can expect to see differences between the two precisions.

这篇关于这些双精度值如何精确到20位小数?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

这些双精度值如何精确到20位小数? [英] How are these double precision values accurate to 20 decimals?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

这些双精度值如何精确到20位小数? [英] How are these double precision values accurate to 20 decimals?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭