使用浮点运算将浮点数转换为十进制数 [英] Converting from floating-point to decimal with floating-point computations

查看:198
本文介绍了使用浮点运算将浮点数转换为十进制数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将浮点双精度值 x 转换为十二进制数,并且有12位(正确舍入)的有效数字。我假设 x 在10 ^ 110和10 ^ 111之间,这样它的十进制表示形式将是 x.xxxxxxxxxxxE110 。而且,为了好玩,我试图只使用浮点算法。

我到达下面的伪代码,其中所有操作都是双精度操作,符号 1e98 是距离数学10 ^ 98最近的两倍, 1e98_2 是离结果最近的两倍的数学减法10 ^ 98 - 1e98 。符号 fmadd(X * Y + Z)用于操作数 X Y Z

  y = x * 2 ^ -1074; //确切
q = y / 1e98; // q是非正规的,q的有效数解释
//作为一个整数是我们候选人十二进制
//数字x

r = fmadd(q * 1e98 - y); //接近1e98 *(在划分过程中发生错误)

//如果1e98_2>> = 0,我们除以比我们希望
小的数字//正确的答案可能是q或q + 1。

if(r和1e98_2有相反的符号)
{
返回q的有效位数;
}

s = copysign(2 ^ -1074,r);
r1 = abs(r);
r2 = abs(1e98_2);

h = 1e98 * 0.5 * 2 ^ -1074;

将舍入模式设置为向下

r3 = fmadd(r2 * q + r1);

if(r3 {
返回q的有效数;



返回(q + s)

$ / code>

我对上述伪代码的混淆表示抱歉,但对于我还不是很清楚,因此提出以下问题:


  1. 第一个fmadd是否像预期的那样工作(计算1e98 *(在划分过程中发生的错误))?

  2. 迹象。我无法说服自己,他们是对的。但是我无法说服自己也是错的。
  3. 如果它起作用,那么如果q = y / 1e98变成q = y * 1e,算法会继续工作-98(所有其他指令保持不变)?


    我没有测试过这个算法。我没有任何带有fmadd指令的计算机,但是我希望找到一个可以执行上面的操作。

    y / d 成为确切的操作, q = rnd(y / d) float。

    那么真正的误差乘以d是 rt =(rnd(y / d)-y / d)* d = q * dy 我们用fmadd执行的操作是 r = rnd(q * dy)

    为什么 q * dy 是精确的(fmadd没有最后的四舍五入)不太清楚解释,但是说 q * d 的位数有限( < code> y 的指数是 q * d (+/- 1),并且由于错误是 | rt | <0.5 * ulp(q)* d ,这意味着首先 nbits(q)正在消失...答案是问题1.
    $ b 所以 q * 1e98 -y = r ,其中 | r | * 2 ^ 1074 <= 0.5e98 < 5 * 10 ^ 98 (第二个不平等是幸运的)

    $ p $ q $(10 ^ 98) - y = r +(10 ^ 98-1e98)* q
    其中 | 10 ^ 98-1e98 | * q * 2 ^ 1074 <= 0.5e95 (假设至少有15个数字的精度, log(2 ^ 53)/ log(10)> 15

    所以你问是否 | q *(10 ^ 98)-y | * 2 ^ 1074> 5 * 10 ^ 97



    您有一个近似值 | q *(10 ^ 98)-y | ,它是 r + 1e98_2 * q | r |

    < 5 * 10 ^ 98 | r +(10 ^ 98-1e98)* q | <| r | 如果符号相反,I认为对问题2的答案是积极的。但是,我不会确定是否1。 0



    如果 r 1e98_2 的符号相同它可能会超过 5 * 10 ^ 97 ,因此您进一步讨论 r3 = 1e98_2 * q + r 与对于第三个问题,乍一看,我会说这个问题。有两件事情可能会导致算法失败:
    $ b $ ul
    $ 1c98_2 不准确( 10 ^ 98-1e98-1e98_2 = -3.6e63 approx。) $ c> h 不是 ht = 0.5 * 10 ^ 98 * 2 ^ -1074 ,但稍微小一些,如上所述。 >


    真正的错误 r3t 大概是 1e98_2-3e63)* q + r < r3 (只有当> 0时我们才感兴趣,因为1e98_2> 0)。

    当真正的误差r3t低于真正的平均值ht可能会导致不正确的舍入。是否有可能,如果是的话,你的问题有多频繁3?为了缓解上述不平等风险,你试图截断r3的大小,因此 r3 <= 1e98_2 * q + r 。所以我扫描了一个错误,我发现第一个失败的例子是1.0000000001835e110(I在这种情况下, r 以及

    1e98_2 具有相同的符号,并且


    • (x / 1e98)> 1000000000183.50000215


    • q 有效数字被舍入为 1000000000184


    • r3> h r3 * 2 ^ 1074 大约是5.000001584620017e97),而我们错误地增加了 q + s ,当它应该是 qs 肯定是一个bug



    我的答案是:
    $ b


    1. 是, r = fmadd(q * 1e98 - y)正好是1e98 *(在划分时出错),但是我们不关心分割,它只是提供一个猜测,重要的是减法是确切的。


    2. 是的,标志是正确的,因为 | r | < 5 * 10 ^ 98 | r +(10 ^ 98-1e98)* q | <| r | 如果符号相反。但是,我不会确定如果1e98_2是<第一个失败的例子(1.0000000001835e110 - 1.0e110)/1.0e110 ulp - >

    3. 1.099632e6 ,一个非常非常天真的猜想就是说,百万分之一的情况下,r3正在下降...所以一旦q + s纠正为qs, r3> h ,而在任何情况下, r3t 远小于1 / 1,000,000 ...有10 ^ 15倍于利息的范围,所以认为这不是一个认真的答案...
    4. 是的,上面的讨论完全是关于猜测q,独立它的产生方式,在1中的减法仍然是确切的... ... $ / $>
      / $>

      I am trying to convert a floating-point double-precision value x to decimal with 12 (correctly rounded) significant digits. I am assuming that x is between 10^110 and 10^111 such that its decimal representation will be of the form x.xxxxxxxxxxxE110. And, just for fun, I am trying to use floating-point arithmetic only.

      I arrived to the pseudo-code below, where all operations are double-precision operations, The notation 1e98 is for the double nearest to the mathematical 10^98, and 1e98_2 is the double nearest to the result of the mathematical subtraction 10^98-1e98. The notation fmadd(X * Y + Z) is for the fused multiply-add operation with operands X,Y, Z.

        y = x * 2^-1074;    // exact
        q = y / 1e98;       // q is denormal and the significand of q interpreted
                            // as an integer is our candidate for the 12 decimal
                            // digits of x
      
        r = fmadd(q * 1e98 - y);  // close to 1e98 * (error made during the division)
      
        // If 1e98_2 >= 0, we divided by a number that was smaller than we wished
        // The correct answer may be q or q+1.
      
        if (r and 1e98_2 have opposite signs)
        {
          return the significand of q;
        }
      
        s = copysign(2^-1074, r);
        r1 = abs(r);
        r2 = abs(1e98_2);
      
        h = 1e98 * 0.5 * 2^-1074;
      
        Set rounding mode to downwards
      
        r3 = fmadd(r2 * q + r1);
      
        if (r3 < h)
        {
          return the significand of q;
        }
        else
        {
          return significand of (q + s)
        }
      

      I apologize for the confusion that pervades the above pseudo-code, but it is not very clear for me yet, hence the following questions:

      1. Does the first fmadd work as intended (to compute 1e98 * (error made during the division))?

      2. The signs. I cannot convince myself that they are right. But I cannot convince myself that they are wrong either.

      3. Any idea, perhaps arguments, about the frequency with which this algorithm might produce the wrong result?

      4. If it works at all, is there any chance that the algorithm will continue to work if "q = y / 1e98" is changed to "q = y * 1e-98" (leaving all other instructions the same)?

      I have not tested this algorithm. I do not have any computer with a fmadd instruction, although I hope to find one so that I can execute the above.

      解决方案

      Let y/d be the exact operation, and q=rnd(y/d) be the result rounded to nearest float.
      Then the true error multiplied by d is rt=(rnd(y/d)-y/d)*d=q*d-y and the operation we performed with fmadd is r=rnd(q*d-y)
      Why q*d-y is exact (fmadd does no final rounding) is less clear to explain, but say that q*d has a limited number of bits (<nbits(q)+nbits(d)), the exponent of y is that of q*d (+/- 1) and since the error is |rt|<0.5*ulp(q)*d, that means that first nbits(q) are vanishing... That answers to question 1.

      So q*1e98 - y = r , where |r|*2^1074 <= 0.5e98 < 5*10^98 (2nd inequality is lucky)

      q*(10^98) - y = r + (10^98-1e98)*q where |10^98-1e98|*q*2^1074 <= 0.5e95 (assuming at least 15 digits precision, log(2^53)/log(10) > 15)

      So you ask whether |q*(10^98)-y|*2^1074>5*10^97

      You have an approximation of |q*(10^98)-y| which is r+1e98_2*q

      Since |r| < 5*10^98, and |r+(10^98-1e98)*q|<|r| if signs are opposite, I think that answers positively to question 2. But I wouldn't be so sure if 1e98_2 were < 0.

      If r and 1e98_2 have same sign it might exceed 5*10^97, thus your further handling with discussion of r3 = 1e98_2*q + r versus h=0.5e98*2^-1074

      For question 3, at first sight, I'd say that two things might make the algorithm fail:

      • 1e98_2 is not exact (10^98-1e98-1e98_2 = -3.6e63 approx.)

      • and h is not ht=0.5*10^98*2^-1074 but a bit smaller as we saw above.

      The true error r3t is approximately (1e98_2-3e63)*q + r < r3 (and only the case when >0 is interesting us, because 1e98_2>0).

      So an approximation of error r3 falling above approximated tie h when the true error r3t is below the true tie ht could lead to an incorrect rounding. Is it possible, and if yes how frequent is your question 3?

      To mitigate above inequality risk, you tried to truncate the magnitude of r3, thus r3 <= 1e98_2*q + r. I felt a bit tired to perform a true analysis of error bounds...

      So I scanned for an error, and the first failing example I found was 1.0000000001835e110 (I assume correctly rounded to nearest double, but it is in fact 1000000000183.49999984153799821120915424942630528225695526491963291846957919215885146546696544423465444842668032e98).

      In this case, r and 1e98_2 have same sign, and

      • (x/1e98) > 1000000000183.50000215

      • q significand is thus rounded to 1000000000184

      • r3>h (r3*2^1074 is approx. 5.000001584620017e97) and we incorrectly incremented q+s, when it should have been q-s, definitely a bug.

      My answers are:

      1. yes, r=fmadd(q * 1e98 - y) is exactly 1e98*(error made during division), but we don't care of the division, it's just providing a guess, what counts is that the subtraction is exact.

      2. yes, the sign is correct because |r| < 5*10^98, and |r+(10^98-1e98)*q|<|r| if signs are opposite. But I wouldn't be so sure if 1e98_2 were < 0.

      3. Taking first failing example (1.0000000001835e110 - 1.0e110)/1.0e110 ulp -> 1.099632e6, a very very naive conjecture would be to say that 1 case out of a million, r3 is falling over h... So once q+s corrected into q-s, the occurence of r3>h while r3t<ht is much much smaller than 1/1,000,000 in any case... there are more than 10^15 doubles in the range of interest, so consider this is not a serious answer...

      4. Yes, the discussion above is solely about the guess q, independently of the way it was produced, and the subtraction in 1. will still be exact...

      这篇关于使用浮点运算将浮点数转换为十进制数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆