float128和双重算术 [英] float128 and double-double arithmetic

查看:302
本文介绍了float128和双重算术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在维基百科上看到,实现四位精度的方法是使用双二进制算术,即使它在位数方面的精度不完全相同: https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format



在这种情况下,我们使用两个double来存储值。所以我们做了两个操作来计算结果,每一个结果的两倍。



在这种情况下,我们可以在每个double上产生舍入误差,或者他们是机制,避免这种情况?

解决方案


在这种情况下,我们使用两个double来存储值。所以我们需要每次做两个操作。

这不是双双算法的工作方式。您应该期望一个双重双重操作可以在6到20个双重操作的任意位置实现,具体取决于实施的实际操作,融合乘加操作的可用性,一个操作数大于另一个的假设,...例如,下面是当FMA指令不可用时的双倍乘法实现,取自 CRlibm

  #define Mul22(zh,zl,xh,xl,yh,yl)\ 
{\
double mh,ml; \
\
const double c = 134217729。 \
翻倍,u1,u2,vp,v1,v2; \
\
up =(xh)* c; vp =(yh)* c; \
u1 =((xh)-up)+ up; v1 =((yh)-vp)+ vp; \
u2 =(xh)-u1; v2 =(yh)-v1; \
\
mh =(xh)*(yh); ((u1 * v1-mh)+(u1 * v2))+(u2 * v1))+(u2 * v2); (bh)*(yl)+(xl)*(yh); \
* zh = mh + ml; \
* zl = mh - (* zh)+ ml; \
}

仅前8个操作就是将操作数分成两半,这样每边的一半可以与另一半的一半相乘,得到的结果正如 double 。计算 u1 * v1 u1 * v2 ,...就是这样做的。

mh ml 中获得的值可以重叠,所以最后的3个操作是将结果重新归一化为两个浮点数的和。


在这种情况下,我们可以每一个double或round-off错误都是一个避免这种情况的机制?

正如评论所说:



$ $ p $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ *相对误差小于2 ^ -102
* /

用于在浮点运算手册< a>。


I've seen in wikipedia that someway to implement quad-precision is to use double-double arithmetic even if it's not exactly the same precision in terms of bits: https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format

In this case, we use two double to store the value. So we make two operations to compute the result, one for each double of the result.

In this case we can have round-off errors on each double or their is a mechanism that avoid this?

解决方案

"In this case, we use two double to store the value. So we need to make two operations at each time."

This is not how double-double arithmetic works. You should expect one double-double operation to be implemented in anywhere from 6 to 20 double operations depending on the actual operation being implemented, the availability of a fused-multiply-add operation, the assumption that one operand is larger than the other, …

For instance, here is one implementation of a double-double multiplication for when an FMA instruction is not available, taken from CRlibm:

#define Mul22(zh,zl,xh,xl,yh,yl)                      \
{                                                     \
double mh, ml;                                        \
                              \
  const double c = 134217729.;                \
  double up, u1, u2, vp, v1, v2;              \
                              \
  up = (xh)*c;        vp = (yh)*c;            \
  u1 = ((xh)-up)+up;  v1 = ((yh)-vp)+vp;          \
  u2 = (xh)-u1;       v2 = (yh)-v1;                   \
                              \
  mh = (xh)*(yh);                     \
  ml = (((u1*v1-mh)+(u1*v2))+(u2*v1))+(u2*v2);        \
                              \
  ml += (xh)*(yl) + (xl)*(yh);                \
  *zh = mh+ml;                        \
  *zl = mh - (*zh) + ml;                              \
}

The first 8 operations alone are for dividing exactly each double from the operands into two halves so that one half from each side can be multiplied with one half from the other side and the result obtained exactly as a double. The computations u1*v1, u1*v2, … do exactly that.

The values obtained in mh and ml can overlap, so the last 3 operations are there to renormalize the result into the sum of two floating-point numbers.

In this case we can have round-off errors on each double or their is a mechanism that avoid this?

As the comment says:

/*
 * computes double-double multiplication: zh+zl = (xh+xl) *  (yh+yl)
 * relative error is smaller than 2^-102
 */

You can find about all the mechanisms used to achieve these results in the Handbook of Floating-Point Arithmetic.

这篇关于float128和双重算术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆