我的 fma() 坏了吗? [英] Is my fma() broken?

查看:22
本文介绍了我的 fma() 坏了吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用 double fma(double x, double y, double z); 我希望在下面标有 的输出行中有一个非零 d>'?'.它似乎在内部仅使用 long double 精度,而不是指定的 infinite 精度.

In using double fma(double x, double y, double z); I'd expect a non-zero d in the output lines below marked with '?'. It appears to internally only use long double precision rather than infinite precision as specified.

fma 函数计算 (x × y) + z,四舍五入为一个三元运算:它们根据当前的舍入模式,将值(好像)计算为无限精度并舍入一次到结果格式.§7.12.13.1 2(我的重点)

The fma functions compute (x × y) + z, rounded as one ternary operation: they compute the value (as if) to infinite precision and round once to the result format, according to the current rounding mode. §7.12.13.1 2 (my emphasis)

我的 fma() 是否损坏,或者我如何在代码或编译选项中错误地使用它?

So is my fma() broken, or how am I using it incorrectly in code or compile options?

#include <float.h>
#include <math.h>
#include <stdio.h>

int main(void) {
  // Invoking: Cygwin C Compiler
  // gcc -std=c11 -O0 -g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 
  //   -v -MMD -MP -MF"x.d" -MT"x.o" -o "x.o" "../x.c"

  printf("FLT_EVAL_METHOD %d
", FLT_EVAL_METHOD);
  for (unsigned i = 20; i < 55; i++) {
    volatile double a = 1.0 + 1.0 / pow(2, i);
    volatile double b = a;
    volatile double c = a * b;
    volatile double d = fma(a, b, -c);
    volatile char *nz = ((i >= 27 && a != 1.0) == !d) ? "?" : "";
    printf("i:%2u a:%21.13a c:%21.13a d:%10a %s
", i, a, c, d, nz);
  }
  return 0;
}

输出

FLT_EVAL_METHOD 2
i:20 a: 0x1.0000100000000p+0 c: 0x1.0000200001000p+0 d:    0x0p+0 
i:21 a: 0x1.0000080000000p+0 c: 0x1.0000100000400p+0 d:    0x0p+0 
i:22 a: 0x1.0000040000000p+0 c: 0x1.0000080000100p+0 d:    0x0p+0 
i:23 a: 0x1.0000020000000p+0 c: 0x1.0000040000040p+0 d:    0x0p+0 
i:24 a: 0x1.0000010000000p+0 c: 0x1.0000020000010p+0 d:    0x0p+0 
i:25 a: 0x1.0000008000000p+0 c: 0x1.0000010000004p+0 d:    0x0p+0 
i:26 a: 0x1.0000004000000p+0 c: 0x1.0000008000001p+0 d:    0x0p+0 
i:27 a: 0x1.0000002000000p+0 c: 0x1.0000004000000p+0 d:   0x1p-54 
i:28 a: 0x1.0000001000000p+0 c: 0x1.0000002000000p+0 d:   0x1p-56 
i:29 a: 0x1.0000000800000p+0 c: 0x1.0000001000000p+0 d:   0x1p-58 
i:30 a: 0x1.0000000400000p+0 c: 0x1.0000000800000p+0 d:   0x1p-60 
i:31 a: 0x1.0000000200000p+0 c: 0x1.0000000400000p+0 d:   0x1p-62 
i:32 a: 0x1.0000000100000p+0 c: 0x1.0000000200000p+0 d:    0x0p+0 ?
i:33 a: 0x1.0000000080000p+0 c: 0x1.0000000100000p+0 d:    0x0p+0 ?
i:34 a: 0x1.0000000040000p+0 c: 0x1.0000000080000p+0 d:    0x0p+0 ?
...
i:51 a: 0x1.0000000000002p+0 c: 0x1.0000000000004p+0 d:    0x0p+0 ?
i:52 a: 0x1.0000000000001p+0 c: 0x1.0000000000002p+0 d:    0x0p+0 ?
i:53 a: 0x1.0000000000000p+0 c: 0x1.0000000000000p+0 d:    0x0p+0 
i:54 a: 0x1.0000000000000p+0 c: 0x1.0000000000000p+0 d:    0x0p+0 

版本信息

gcc -v

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/i686-pc-cygwin/5.3.0/lto-wrapper.exe
Target: i686-pc-cygwin
Configured with: /cygdrive/i/szsz/tmpp/gcc/gcc-5.3.0-5.i686/src/gcc-5.3.0/configure --srcdir=/cygdrive/i/szsz/tmpp/gcc/gcc-5.3.0-5.i686/src/gcc-5.3.0 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc --docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C --build=i686-pc-cygwin --host=i686-pc-cygwin --target=i686-pc-cygwin --without-libiconv-prefix --without-libintl-prefix --libexecdir=/usr/lib --enable-shared --enable-shared-libgcc --enable-static --enable-version-specific-runtime-libs --enable-bootstrap --enable-__cxa_atexit --with-dwarf2 --with-arch=i686 --with-tune=generic --disable-sjlj-exceptions --enable-languages=ada,c,c++,fortran,java,lto,objc,obj-c++ --enable-graphite --enable-threads=posix --enable-libatomic --enable-libcilkrts --enable-libgomp --enable-libitm --enable-libquadmath --enable-libquadmath-support --enable-libssp --enable-libada --enable-libjava --enable-libgcj-sublibs --disable-java-awt --disable-symvers --with-ecj-jar=/usr/share/java/ecj.jar --with-gnu-ld --with-gnu-as --with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix --without-libintl-prefix --with-system-zlib --enable-linker-build-id --with-default-libstdcxx-abi=gcc4-compatible
Thread model: posix
gcc version 5.3.0 (GCC) 

推荐答案

是 Cygwin 的错.或者更准确地说,它使用的 newlib C 库.它 明确表示它甚至没有尝试让 fma() 仿真正确.

It is Cygwin's fault. Or more correctly, the newlib C library it uses. It explicitly says it does not even try to get fma() emulation correct.

自 2015 年以来,GNU C 库对几乎所有 fma 变体都有正确的仿真.有关详细信息以及用于实现此功能的补丁,请参阅源软件错误 13304.

GNU C library has correct emulation for almost all fma variants since 2015. For details, and the patches used to implement this, see the sourceware bug 13304.

如果效率不是问题,那么我会简单地使用例如

If efficiency is not a problem, then I would simply use e.g.

#if defined(__CYGWIN__) && !defined(__FMA__) && !defined(__FMA3__) && !defined(__FMA4__)
#define fma(x, y, z)  fma_emulation(x, y, z)

double fma_emulation(double x, double y, double z)
{
    /* One of the implementations linked above */
}
#endif

我个人根本不使用 Windows,但如果有人使用(使用 Windows 并需要 fma 仿真),我建议他们尝试向上游推送补丁,并带有指向 关于正确 fma 仿真的 GNU C 库讨论.

I do not personally use Windows at all, but if anyone does (use Windows and need the fma emulation), I'd suggest they try and push a patch upstream, with a link to the GNU C library discussion on correct fma emulation.

我想知道的是,是否可以仅检查结果的低 M 位(舍入中丢弃)以确定结果中 ULP 的正确值,以及使用 nextafter() 相应地调整使用直接 a×b+c 操作获得的结果;而不是使用多精度算术来实现整个操作.

What I am wondering, is whether it would be possible to examine just the low M bits of the result (discarded in the rounding) to determine the correct value of the ULP in the result, and adjust the result obtained using the straightforward a×b+c operation accordingly, using nextafter(); rather than using multiprecision arithmetic to implement the entire operation.

不,因为加法可能会溢出,丢弃一个额外的位作为丢弃部分的 MSB.仅出于这个原因,我们确实需要完成整个操作.另一个原因是,如果 a×bc 有不同的符号,那么我们不是加法,而是从较大的量级中减去较小的量级(结果使用较大的符号),这可能会导致清除较大的几个高位,并影响整个结果的哪些位在舍入中被丢弃.

No, because the addition may overflow, dropping an extra bit as the MSB of the discarded part. For that reason alone, we do need to do the entire operation. Another reason is that if a×b and c have different signs, then instead of addition, we substract smaller in magnitude from larger in magnitude (result using the larger's sign), which may lead to clearing several high bits in the larger, and that affects which bits of the entire result are dropped in the rounding.

但是,对于 x86 和 x86-64 架构上的 IEEE-754 Binary64 double,我相信使用 64 位(整数)寄存器和 128 位乘积的 fma 仿真仍然是相当可行的.我将试验一个表示,其中 64 位寄存器中的低 2 位用于舍入决策位(LSB 是所有丢弃位的逻辑或),53 位用于尾数,一个进位位,剩下 8未使用和忽略的高位.当无符号整数尾数转换为(64 位)双精度时执行舍入.如果这些实验产生任何有用的东西,我会在这里描述它们.

However, for the IEEE-754 Binary64 double on x86 and x86-64 architectures, I do believe fma emulation using 64-bit (integer) registers and 128-bit product, is still quite feasible. I shall experiment on a representation where low 2 bits in a 64-bit register are used for the rounding decision bits (LSB being a logical OR of all the dropped bits), 53 bits used for the mantissa, and one carry bit, leaving 8 unused and ignored high bits. The rounding is performed when the unsigned integer mantissa is converted to a (64-bit) double. If these experiments yield anything useful, I shall describe them here.

初步发现:fma() 在 32 位系统上的仿真速度很慢.387 FPU 上的 80 位东西在这里基本上没用,在 32 位系统上实现 53×53 位乘法(和位移)只是......不值得努力.在我看来,上面链接的 glibc fma() 仿真代码已经足够好了.

Initial findings: fma() emulation on a 32-bit system is slow. The 80-bit stuff on the 387 FPU is basically useless here, and implementing the 53×53-bit multiplication (and bit shifting) on a 32-bit system is just .. not worth the effort. The glibc fma() emulation code linked to above is good enough in my opinion.

其他发现:处理非有限值是讨厌.(次正规数只是有点烦人,需要特殊处理(因为尾数中的隐式 MSB 为零).)如果三个参数中的任何一个是非有限的(无穷大或某种形式的 NaN),则返回 a*b+ c (未融合)是唯一明智的选择.处理这些情况需要额外的分支,这会减慢仿真速度.

Additional findings: Handling non-finite values is nasty. (Subnormals are only slightly annoying, requiring special handling (as the implicit MSB in mantissa is zero then).) If any of the three arguments is non-finite (infinity or some form of NaN), then returning a*b + c (not fused) is the only sane option. Handling these cases require additional branching, which slows down the emulation.

最终决定:以优化方式处理的案例数量(而不是使用 glibc 仿真中使用的多精度肢体"方法)足以使这种方法不值得努力.如果每个肢体是 64 位,则 abc 中的每一个都最多分布在 2 个肢体上,并且 a×b 超过三肢.(对于 32 位肢体,分别只有 3 个和 5 个肢体.)取决于 a×bc 是否具有相同的或不同的符号,只有两种根本不同的情况需要处理——在不同符号的情况下,加法变成减法(从大到小,结果与较大的值得到相同的符号).

Final decision: The number of cases to handle in an optimized manner (rather than using the multiprecision "limb" approach as used in the glibc emulation) is large enough to make this approach not worth the effort. If each limb is 64-bit, each of a, b, and c is spread over at most 2 limbs, and a×b over three limbs. (With 32-bit limbs, that is just 3 and 5 limbs, respectively.) Depending on whether a×b and c have the same or different signs, there are only two fundamentally different cases to handle -- in the different signs case, the addition turns into subtraction (smaller from larger, result getting the same sign as the larger value).

简而言之,多精度方法更好.所需的实际精度非常有限,甚至不需要动态分配.如果 ab 的尾数乘积可以高效计算,那么多精度部分仅限于保持乘积和处理加法/减法.最终舍入可以通过将结果转换为 53 位尾数、指数和两个额外的低位来完成(较高的是舍入中丢失的最高有效位,而较低的是舍入中丢失的其余位的 OR四舍五入).本质上,关键操作可以使用整数(或 SSE/AVX 寄存器)完成,从 55 位尾数到 double 的最终转换根据当前规则处理舍入.

In short, the multiprecision approach is better. The actual precision needed is very well bounded, and does not need even dynamic allocation. If the product of the mantissas of a and b can be calculated efficiently, the multiprecision part is limited to holding the product and handling the addition/subtraction. Final rounding can be done by converting the result to a 53-bit mantissa, exponent, and two extra low bits (the higher being the most significant bit lost in the rounding, and the lower being an OR of the rest of the bits lost in the rounding). Essentially, the key operations can be done using integers (or SSE/AVX registers), and the final conversion from a 55-bit mantissa to double handles the rounding according to current rules.

这篇关于我的 fma() 坏了吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆