是否有文档描述 Clang 如何处理过多的浮点精度? [英] Is there a document describing how Clang handles excess floating-point precision?

查看：27 发布时间：2022/1/9 10:49:57 c floating-point clang c99 extended-precision

本文介绍了是否有文档描述 Clang 如何处理过多的浮点精度?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当唯一允许使用的浮点指令是 387 时，几乎不可能 (*) 以合理的成本提供严格的 IEEE 754 语义.当希望保持 FPU 在完整的 64 位有效位上工作以便 long double 类型可用于扩展精度时，尤其困难.通常的解决方案"是以唯一可用的精度进行中间计算，并在或多或少定义明确的情况下转换为较低的精度.

最新版本的 GCC 根据 Joseph S. Myers 在 2008 年发布到 GCC 邮件列表.据我所知，这个描述使得使用 gcc -std=c99 -mno-sse2 -mfpmath=387 编译的程序完全可以预测，直到最后一点.如果碰巧没有，这就是一个错误，会被修复:Joseph S. Myers 在他的帖子中声明的意图是让它可预测.

是否记录了 Clang 如何处理超额精度(例如何时使用选项 -mno-sse2)以及在何处处理?

(*) 这是夸大其词.有点烦人但没那么难在允许将 x87 FPU 配置为使用 53 位有效数字时模拟 binary64.

<小时>

根据 R.. 下面的评论，这是我与我拥有的最新版本的 Clang 的简短交互日志:

Hexa:~ $ clang -vApple clang 版本 4.1 (tags/Apple/clang-421.11.66) (基于 LLVM 3.1svn)目标:x86_64-apple-darwin12.4.0线程模型:posix六边形:~ $ cat fem.c#include <stdio.h>#include <math.h>#include <float.h>#include <fenv.h>双x;双 y = 2.0;双 z = 1.0;诠释主要(){x = y + z;printf("%d
", (int) FLT_EVAL_METHOD);}六进制:~ $ clang -std=c99 -mno-sse2 fem.c六边形:~ $ ./a.out0六进制:~ $ clang -std=c99 -mno-sse2 -S fem.c六边形:~ $ cat fem.s…移动 $0, %esifldl _y(%rip)fldl _z(%rip)faddp %st(1)movq _x@GOTPCREL(%rip), %raxfstpl (%rax)…

解决方案

这并不能回答最初提出的问题，但如果你是一个处理类似问题的程序员，这个答案可能会对你有所帮助.

我真的不知道感知到的困难在哪里.提供严格的 IEEE-754 binary64 语义，同时仅限于 80387 浮点数学，并保留 80 位长双精度计算，似乎遵循 GCC-4.6.3 和 clang-3.0 的明确规定的 C99 转换规则(基于 LLVM3.0).

编辑添加:然而，Pascal Cuoq 是正确的:gcc-4.6.3 或 clang-llvm-3.0 实际上对于 '387 浮点数学执行这些规则正确.给定正确的编译器选项，规则会正确应用于编译时评估的表达式，但不适用于运行时表达式.有一些解决方法，在下面的休息后列出.

我编写分子动力学模拟代码，非常熟悉可重复性/可预测性要求，并且希望尽可能保持最大精度，所以我确实声称我知道我在说什么.这个答案应该表明这些工具存在并且易于使用；问题源于不了解或不使用这些工具.

(我喜欢的一个首选示例是 Kahan 求和算法.使用 C99 和适当的强制转换(将强制转换添加到例如 Wikipedia 示例代码)，根本不需要任何技巧或额外的临时变量.无论编译器优化级别如何，该实现都有效，包括 -O3 和 -Ofast.)

C99 明确声明(在例如 5.4.2.2 中)强制转换和赋值都删除了所有额外的范围和精度.这意味着您可以通过将计算期间使用的临时变量定义为 long double 来使用 long double 算术，同时将输入变量转换为该类型；每当需要 IEEE-754 binary64 时，只需转换为 double.

在 '387 上，强制转换会在上述两个编译器上生成赋值和加载；这确实将 80 位值正确舍入为 IEEE-754 binary64.在我看来，这个成本是非常合理的.所花费的确切时间取决于架构和周围的代码；通常它可以并且可以与其他代码交错以将成本降低到可以忽略的水平.当 MMX、SSE 或 AVX 可用时，它们的寄存器与 80 位 80387 寄存器是分开的，通常通过将值移动到 MMX/SSE/AVX 寄存器来完成转换.

(我更喜欢生产代码使用特定的浮点类型，比如 tempdouble 或类似的临时变量，以便可以将其定义为 double 或long double 取决于所需的架构和速度/精度权衡.)

简而言之:

<块引用>

不要假设 (expression) 是 double 精度，因为所有变量和文字常量都是.如果您希望结果为 double 精度，请将其写为 (double)(expression).

这也适用于复合表达式，有时可能会导致具有多个强制转换级别的笨拙表达式.

如果您希望以 80 位精度计算 expr1 和 expr2，但还需要先将每个结果舍入到 64 位，请使用

long double expr1;长双 expr2;双倍乘积 = (double)(expr1) * (double)(expr2);

注意，product 计算为两个 64 位值的乘积；not 以 80 位精度计算，然后向下舍入.以 80 位精度计算乘积，然后向下舍入，将是

double other = expr1 * expr2;

或者，添加能够准确告诉您正在发生的事情的描述性转换，

double other = (double)((long double)(expr1) * (long double)(expr2));

product 和 other 的区别应该很明显.

如果您确实使用混合的 32 位/64 位/80 位/128 位浮点值，C99 转换规则只是您必须学会使用的另一个工具.确实，如果混合使用 binary32 和 binary64 浮点数(在大多数架构上为 float 和 double)，您会遇到完全相同的问题！

也许重写 Pascal Cuoq 的探索代码，以正确应用强制转换规则，让这更清楚?

#include <stdio.h>#define TEST(eq) printf("%-56s%s
", "" # eq ":", (eq) ? "true" : "false")诠释主要(无效){双 d = 1.0/10.0;长双 ld = 1.0L/10.0L；printf("sizeof (double) = %d
", (int)sizeof (double));printf("sizeof (long double) == %d
", (int)sizeof (long double));printf("
期望为真:
");测试(d ==(双)(0.1))；TEST(ld == (long double)(0.1L));测试(d ==(双)(1.0/10.0))；TEST(ld == (long double)(1.0L/10.0L));测试(d ==(双)(ld))；测试((双)(1.0L/10.0L)==(双)(0.1))；TEST((长双)(1.0L/10.0L) == (长双)(0.1L));printf("
预期为假:
");测试(d == ld)；测试((长双)(d)== ld)；测试(d == 0.1L)；测试(ld == 0.1)；TEST(d == (long double)(1.0L/10.0L));测试(ld ==(双)(1.0L/10.0))；返回0；}

GCC 和 clang 的输出是

sizeof (double) = 8sizeof (long double) == 12期待真实:d == (双)(0.1): 真ld == (long double)(0.1L): trued == (双)(1.0/10.0): 真ld == (long double)(1.0L/10.0L): trued == (双)(ld): 真(双)(1.0L/10.0L)==(双)(0.1):真(长双)(1.0L/10.0L)==(长双)(0.1L):真期望错误:d == ld:假(长双)(d)== ld:假d == 0.1L:假ld == 0.1:假d == (long double)(1.0L/10.0L): falseld == (双)(1.0L/10.0): false

除了最新版本的 GCC 将 ld == 0.1 的右侧提升为 long double first(即到 ld == 0.1L)，产生 true，而对于 SSE/AVX，long double 是 128 位的.

对于纯 '387 测试，我使用了

gcc -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test铿锵 -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test

各种优化标志组合为...，包括-fomit-frame-pointer、-O0、-O1、-O2、-O3 和 -Os.

使用任何其他标志或 C99 编译器应该会导致相同的结果，除了 long double 大小(以及当前 GCC 版本的 ld == 1.0).如果您遇到任何差异，我将非常感谢您了解它们；我可能需要警告我的用户使用此类编译器(编译器版本).请注意，Microsoft 不支持 C99，因此我对它们完全不感兴趣.

<小时>

Pascal Cuoq 确实在下面的评论链中提出了一个有趣的问题，我没有立即意识到.

在计算表达式时，GCC 和 clang 都使用 -mfpmath=387 指定使用 80 位精度计算所有表达式.这导致例如

7491907632491941888 = 0x1.9fe2693112e14p+62 = 1100111111110001001101001001100010001001011100001010000000000005698883734965350400 = 0x1.3c5a02407b71cp+62 = 10011110001011010000000100100000001111011011100011100000000000074988888 * 5698883734965350400 = 42695541385555200 = 1000000001100100/10111111001000110101110010001ds11111110010001ds1111100100011,0001100100100d

产生不正确的结果，因为二进制结果中间的一串 1 正好在 53 位和 64 位尾数之间(分别为 64 位和 80 位浮点数).所以，虽然预期的结果是

<预> <代码> 42695510550671088819251326462451515392 = 0x1.00f6d98d0a42fp + 125 = 100000000111101101101100110001101000010100100001011110000000000000000000000000000000000000000000000000000000000000000000000000

仅使用 -std=c99 -m32 -mno-sse -mfpmath=387 得到的结果是

 <代码> 426955105506703030303P + 125 0001,00010100/10100600/301000000/30103p + 125 = 100000000/1100011,00010101000000/101000000c000000c0000/1010000000000/1000000c000000c0000c000000c0000c0000/1030000000000c6710c80000ca10000c000000c0000c0000c0000c0000c0000/303p = 0x1.0000c0000capp1000000c6500cai0000000000c67000c600cio-m

理论上，您应该能够通过使用选项告诉 gcc 和 clang 执行正确的 C99 舍入规则

-std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard

但是，这只影响编译器优化的表达式，并且似乎根本无法修复 387 处理.如果你使用例如clang -O1 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard test.c -o test &&./test 其中 test.c 是 Pascal Cuoq 的示例程序，你将根据 IEEE-754 规则得到正确的结果——但这只是因为编译器优化了表达式，根本不使用 387.

简单地说，而不是计算

(double)d1 * (double)d2

gcc 和 clang 实际上都告诉 '387 进行计算

(double)((long double)d1 * (long double)d2)

~~这确实是~~我相信这是一个影响 gcc-4.6.3 和 clang-llvm-3.0 的编译器错误，并且很容易复制.(Pascal Cuoq 指出 FLT_EVAL_METHOD=2 意味着对双精度参数的操作总是以扩展精度完成，但我看不出任何合理的原因——除了必须重写部分 libm on '387 -- 在 C99 中做到这一点并考虑到 IEEE-754 规则可以通过硬件实现！毕竟，正确的操作很容易被编译器实现，通过修改'387 个控制字来匹配表达式的精度.并且，考虑到编译器选项应该强制这种行为 -- -std=c99 -ffloat-store -fexcess-precision=standard -- 如果确实需要 FLT_EVAL_METHOD=2 行为，则没有任何意义，也不存在向后兼容性问题.)重要的是要注意，给定正确的编译器标志，在编译时评估的表达式确实得到了正确的评估，并且只有在运行时评估的表达式才会得到不正确的结果.

最简单且可移植的解决方法是使用 fesetround(FE_TOWARDZERO)(来自 fenv.h)将所有结果四舍五入为零.

在某些情况下，向零舍入可能有助于可预测性和病理情况.特别是对于像 x = [0,1) 这样的区间，向零舍入意味着通过舍入永远不会达到上限；如果您评估例如，这很重要分段样条.

其他舍入方式需要直接控制387硬件.

您可以使用 #include <fpu_control.h> 中的 __FPU_SETCW() 或打开代码.例如，precision.c:

#include <stdlib.h>#include <stdio.h>#include <limits.h>#define FP387_NEAREST 0x0000#define FP387_ZERO 0x0C00#define FP387_UP 0x0800#define FP387_DOWN 0x0400#define FP387_SINGLE 0x0000#define FP387_DOUBLE 0x0200#define FP387_EXTENDED 0x0300static inline void fp387(const unsigned short control){无符号短 cw = (控制 & 0x0F00) |0x007f;__asm__ volatile ("fldcw %0" : : "m" (*&cw));}const char *bits(const double value){const unsigned char *const data = (const unsigned char *)&value;静态字符缓冲区[CHAR_BIT * sizeof value + 1];字符 *p = 缓冲区；size_t i = CHAR_BIT * sizeof 值；而 (i-->0)*(p++) = '0' + !!(数据[i/CHAR_BIT] & (1U <<(i % CHAR_BIT)));*p = '';返回 (const char *) 缓冲区；}int main(int argc, char *argv[]){双 d1, d2;字符假人；如果(argc！= 3){fprintf(stderr, "
使用: %s 7491907632491941888 5698883734965350400

", argv[0]);返回 EXIT_FAILURE；}if (sscanf(argv[1], " %lf %c", &d1, &dummy) != 1) {fprintf(stderr, "%s: 不是数字.
", argv[1]);返回 EXIT_FAILURE；}if (sscanf(argv[2], "%lf %c", &d2, &dummy) != 1) {fprintf(stderr, "%s: 不是数字.
", argv[2]);返回 EXIT_FAILURE；}printf("%s:	d1 = %.0f
	 %s in binary
", argv[1], d1, bits(d1));printf("%s:	d2 = %.0f
	 %s in binary
", argv[2], d2, bits(d2));printf("
默认值:
");printf("Product = %.0f
	 %s in binary
", d1 * d2, bits(d1 * d2));printf("
扩展精度，四舍五入到最接近的整数:
");fp387(FP387_EXTENDED | FP387_NEAREST)；printf("Product = %.0f
	 %s in binary
", d1 * d2, bits(d1 * d2));printf("
双精度，四舍五入:
");fp387(FP387_DOUBLE | FP387_NEAREST)；printf("Product = %.0f
	 %s in binary
", d1 * d2, bits(d1 * d2));printf("
扩展精度，四舍五入:
");fp387(FP387_EXTENDED | FP387_ZERO)；printf("Product = %.0f
	 %s in binary
", d1 * d2, bits(d1 * d2));printf("
双精度，四舍五入:
");fp387(FP387_DOUBLE | FP387_ZERO)；printf("Product = %.0f
	 %s in binary
", d1 * d2, bits(d1 * d2));返回0；}

使用clang-llvm-3.0编译运行，得到正确结果，

clang -std=c99 -m32 -mno-sse -mfpmath=387 -O3 -W -Wall precision.c -o precision./精度 7491907632491941888 56988837349653504007491907632491941888:d1 = 74919076324919418880100001111011001111111100010011010010011000100010010111000010100 二进制5698883734965350400:d2 = 56988837349653504000100001111010011110001011010000000100100000001111011011100011100 二进制默认值:产品 = 426955105506710982639842922017419427840100011111000000000011110110110110011000110100001010010000110000 二进制扩展精度，四舍五入到最接近的整数:产品 = 426955105506710982639842922017419427840100011111000000000011110110110110011000110100001010010000110000 二进制双精度，四舍五入到最接近的整数:产品 = 426955105506710888192513264624515153920100011111000000000011110110110110011000110100001010010000101111 二进制扩展精度，舍入为零:产品 = 426955105506710888192513264624515153920100011111000000000011110110110110011000110100001010010000101111 二进制双精度，舍入为零:产品 = 426955105506710888192513264624515153920100011111000000000011110110110110011000110100001010010000101111 二进制

换句话说，您可以通过使用 fp387() 设置精度和舍入模式来解决编译器问题.

缺点是某些数学库(libm.a、libm.so)可能会在假设中间结果始终以 80 位精度计算的情况下编写.至少 x86_64 上的 GNU C 库 fpu_control.h 有注释 libm 需要扩展精度".幸运的是，您可以从例如 '387 实现GNU C 库，如果你需要 math.h 功能，可以在头文件中实现它们或编写一个已知可工作的 libm；事实上，我想我或许能在这方面提供帮助.

It is nearly impossible(*) to provide strict IEEE 754 semantics at reasonable cost when the only floating-point instructions one is allowed to used are the 387 ones. It is particularly hard when one wishes to keep the FPU working on the full 64-bit significand so that the long double type is available for extended precision. The usual "solution" is to do intermediate computations at the only available precision, and to convert to the lower precision at more or less well-defined occasions.

Recent versions of GCC handle excess precision in intermediate computations according to the interpretation laid out by Joseph S. Myers in a 2008 post to the GCC mailing list. This description makes a program compiled with gcc -std=c99 -mno-sse2 -mfpmath=387 completely predictable, to the last bit, as far as I understand. And if by chance it doesn't, it is a bug and it will be fixed: Joseph S. Myers' stated intention in his post is to make it predictable.

Is it documented how Clang handles excess precision (say when the option -mno-sse2 is used), and where?

(*) EDIT: this is an exaggeration. It is slightly annoying but not that difficult to emulate binary64 when one is allowed to configure the x87 FPU to use a 53-bit significand.

Following a comment by R.. below, here is the log of a short interaction of mine with the most recent version of Clang I have :

Hexa:~ $ clang -v
Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.4.0
Thread model: posix
Hexa:~ $ cat fem.c
#include <stdio.h>
#include <math.h>
#include <float.h>
#include <fenv.h>

double x;
double y = 2.0;
double z = 1.0;

int main(){
  x = y + z;
  printf("%d
", (int) FLT_EVAL_METHOD);
}
Hexa:~ $ clang -std=c99 -mno-sse2 fem.c
Hexa:~ $ ./a.out 
0
Hexa:~ $ clang -std=c99 -mno-sse2 -S fem.c
Hexa:~ $ cat fem.s 
…
    movl    $0, %esi
    fldl    _y(%rip)
    fldl    _z(%rip)
    faddp   %st(1)
    movq    _x@GOTPCREL(%rip), %rax
    fstpl   (%rax)
…

解决方案

This does not answer the originally posed question, but if you are a programmer working with similar issues, this answer might help you.

I really don't see where the perceived difficulty is. Providing strict IEEE-754 binary64 semantics while being limited to 80387 floating-point math, and retaining 80-bit long double computation, seems to follow well-specified C99 casting rules with both GCC-4.6.3 and clang-3.0 (based on LLVM 3.0).

Edited to add: Yet, Pascal Cuoq is correct: neither gcc-4.6.3 or clang-llvm-3.0 actually enforce those rules correctly for '387 floating-point math. Given the proper compiler options, the rules are correctly applied to expressions evaluated at compile time, but not for run-time expressions. There are workarounds, listed after the break below.

I do molecular dynamics simulation code, and am very familiar with the repeatability/predictability requirements and also with the desire to retain maximum precision available when possible, so I do claim I know what I am talking about here. This answer should show that the tools exist and are simple to use; the problems arise from not being aware of or not using those tools.

(A preferred example I like, is the Kahan summation algorithm. With C99 and proper casting (adding casts to e.g. Wikipedia example code), no tricks or extra temporary variables are needed at all. The implementation works regardless of compiler optimization level, including at -O3 and -Ofast.)

C99 explicitly states (in e.g. 5.4.2.2) that casting and assignment both remove all extra range and precision. This means that you can use long double arithmetic by defining your temporary variables used during computation as long double, also casting your input variables to that type; whenever a IEEE-754 binary64 is needed, just cast to a double.

On '387, the cast generates an assignment and a load on both the above compilers; this does correctly round the 80-bit value to IEEE-754 binary64. This cost is very reasonable in my opinion. The exact time taken depends on the architecture and surrounding code; usually it is and can be interleaved with other code to bring the cost down to neglible levels. When MMX, SSE or AVX are available, their registers are separate from the 80-bit 80387 registers, and the cast usually is done by moving the value to the MMX/SSE/AVX register.

(I prefer production code to use a specific floating-point type, say tempdouble or such, for temporary variables, so that it can be defined to either double or long double depending on architecture and speed/precision tradeoffs desired.)

In a nutshell:

Don't assume (expression) is of double precision just because all the variables and literal constants are. Write it as (double)(expression) if you want the result at double precision.

This applies to compound expressions, too, and may sometimes lead to unwieldy expressions with many levels of casts.

If you have expr1 and expr2 that you wish to compute at 80-bit precision, but also need the product of each rounded to 64-bit first, use

long double  expr1;
long double  expr2;
double       product = (double)(expr1) * (double)(expr2);

Note, product is computed as a product of two 64-bit values; not computed at 80-bit precision, then rounded down. Calculating the product at 80-bit precision, then rounding down, would be

double       other = expr1 * expr2;

or, adding descriptive casts that tell you exactly what is happening,

double       other = (double)((long double)(expr1) * (long double)(expr2));

It should be obvious that product and other often differ.

The C99 casting rules are just another tool you must learn to wield, if you do work with mixed 32-bit/64-bit/80-bit/128-bit floating point values. Really, you encounter the exact same issues if you mix binary32 and binary64 floats (float and double on most architectures)!

Perhaps rewriting Pascal Cuoq's exploration code, to correctly apply casting rules, makes this clearer?

#include <stdio.h>

#define TEST(eq) printf("%-56s%s
", "" # eq ":", (eq) ? "true" : "false")

int main(void)
{
    double d = 1.0 / 10.0;
    long double ld = 1.0L / 10.0L;

    printf("sizeof (double) = %d
", (int)sizeof (double));
    printf("sizeof (long double) == %d
", (int)sizeof (long double));

    printf("
Expect true:
");
    TEST(d == (double)(0.1));
    TEST(ld == (long double)(0.1L));
    TEST(d == (double)(1.0 / 10.0));
    TEST(ld == (long double)(1.0L / 10.0L));
    TEST(d == (double)(ld));
    TEST((double)(1.0L/10.0L) == (double)(0.1));
    TEST((long double)(1.0L/10.0L) == (long double)(0.1L));

    printf("
Expect false:
");
    TEST(d == ld);
    TEST((long double)(d) == ld);
    TEST(d == 0.1L);
    TEST(ld == 0.1);
    TEST(d == (long double)(1.0L / 10.0L));
    TEST(ld == (double)(1.0L / 10.0));

    return 0;
}

The output, with both GCC and clang, is

sizeof (double) = 8
sizeof (long double) == 12

Expect true:
d == (double)(0.1):                                     true
ld == (long double)(0.1L):                              true
d == (double)(1.0 / 10.0):                              true
ld == (long double)(1.0L / 10.0L):                      true
d == (double)(ld):                                      true
(double)(1.0L/10.0L) == (double)(0.1):                  true
(long double)(1.0L/10.0L) == (long double)(0.1L):       true

Expect false:
d == ld:                                                false
(long double)(d) == ld:                                 false
d == 0.1L:                                              false
ld == 0.1:                                              false
d == (long double)(1.0L / 10.0L):                       false
ld == (double)(1.0L / 10.0):                            false

except that recent versions of GCC promote the right hand side of ld == 0.1 to long double first (i.e. to ld == 0.1L), yielding true, and that with SSE/AVX, long double is 128-bit.

For the pure '387 tests, I used

gcc -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test
clang -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test

with various optimization flag combinations as ..., including -fomit-frame-pointer, -O0, -O1, -O2, -O3, and -Os.

Using any other flags or C99 compilers should lead to the same results, except for long double size (and ld == 1.0 for current GCC versions). If you encounter any differences, I'd be very grateful to hear about them; I may need to warn my users of such compilers (compiler versions). Note that Microsoft does not support C99, so they are completely uninteresting to me.

Pascal Cuoq does bring up an interesting problem in the comment chain below, which I didn't immediately recognize.

When evaluating an expression, both GCC and clang with -mfpmath=387 specify that all expressions are evaluated using 80-bit precision. This leads to for example

7491907632491941888 = 0x1.9fe2693112e14p+62 = 110011111111000100110100100110001000100101110000101000000000000
5698883734965350400 = 0x1.3c5a02407b71cp+62 = 100111100010110100000001001000000011110110111000111000000000000

7491907632491941888 * 5698883734965350400 = 42695510550671093541385598890357555200 = 100000000111101101101100110001101000010100100001011110111111111111110011000111000001011101010101100011000000000000000000000000

yielding incorrect results, because that string of ones in the middle of the binary result is just at the difference between 53- and 64-bit mantissas (64 and 80-bit floating point numbers, respectively). So, while the expected result is

42695510550671088819251326462451515392 = 0x1.00f6d98d0a42fp+125 = 100000000111101101101100110001101000010100100001011110000000000000000000000000000000000000000000000000000000000000000000000000

the result obtained with just -std=c99 -m32 -mno-sse -mfpmath=387 is

42695510550671098263984292201741942784 = 0x1.00f6d98d0a43p+125 = 100000000111101101101100110001101000010100100001100000000000000000000000000000000000000000000000000000000000000000000000000000

In theory, you should be able to tell gcc and clang to enforce the correct C99 rounding rules by using options

-std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard

However, this only affects expressions the compiler optimizes, and does not seem to fix the 387 handling at all. If you use e.g. clang -O1 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard test.c -o test && ./test with test.c being Pascal Cuoq's example program, you will get the correct result per IEEE-754 rules -- but only because the compiler optimizes away the expression, not using the 387 at all.

Simply put, instead of computing

(double)d1 * (double)d2

both gcc and clang actually tell the '387 to compute

(double)((long double)d1 * (long double)d2)

~~This is indeed~~ I believe this is a compiler bug affecting both gcc-4.6.3 and clang-llvm-3.0, and an easily reproduced one. (Pascal Cuoq points out that FLT_EVAL_METHOD=2 means operations on double-precision arguments is always done at extended precision, but I cannot see any sane reason -- aside from having to rewrite parts of libm on '387 -- to do that in C99 and considering IEEE-754 rules are achievable by the hardware! After all, the correct operation is easily achievable by the compiler, by modifying the '387 control word to match the precision of the expression. And, given the compiler options that should force this behaviour -- -std=c99 -ffloat-store -fexcess-precision=standard -- make no sense if FLT_EVAL_METHOD=2 behaviour is actually desired, there is no backwards compatibility issues, either.) It is important to note that given the proper compiler flags, expressions evaluated at compile time do get evaluated correctly, and that only expressions evaluated at run time get incorrect results.

The simplest workaround, and the portable one, is to use fesetround(FE_TOWARDZERO) (from fenv.h) to round all results towards zero.

In some cases, rounding towards zero may help with predictability and pathological cases. In particular, for intervals like x = [0,1), rounding towards zero means the upper limit is never reached through rounding; important if you evaluate e.g. piecewise splines.

For the other rounding modes, you need to control the 387 hardware directly.

You can use either __FPU_SETCW() from #include <fpu_control.h>, or open-code it. For example, precision.c:

#include <stdlib.h>
#include <stdio.h>
#include <limits.h>

#define FP387_NEAREST   0x0000
#define FP387_ZERO      0x0C00
#define FP387_UP        0x0800
#define FP387_DOWN      0x0400

#define FP387_SINGLE    0x0000
#define FP387_DOUBLE    0x0200
#define FP387_EXTENDED  0x0300

static inline void fp387(const unsigned short control)
{
    unsigned short cw = (control & 0x0F00) | 0x007f;
    __asm__ volatile ("fldcw %0" : : "m" (*&cw));
}

const char *bits(const double value)
{
    const unsigned char *const data = (const unsigned char *)&value;
    static char buffer[CHAR_BIT * sizeof value + 1];
    char       *p = buffer;
    size_t      i = CHAR_BIT * sizeof value;
    while (i-->0)
        *(p++) = '0' + !!(data[i / CHAR_BIT] & (1U << (i % CHAR_BIT)));
    *p = '';
    return (const char *)buffer;
}


int main(int argc, char *argv[])
{
    double  d1, d2;
    char    dummy;

    if (argc != 3) {
        fprintf(stderr, "
Usage: %s 7491907632491941888 5698883734965350400

", argv[0]);
        return EXIT_FAILURE;
    }

    if (sscanf(argv[1], " %lf %c", &d1, &dummy) != 1) {
        fprintf(stderr, "%s: Not a number.
", argv[1]);
        return EXIT_FAILURE;
    }
    if (sscanf(argv[2], " %lf %c", &d2, &dummy) != 1) {
        fprintf(stderr, "%s: Not a number.
", argv[2]);
        return EXIT_FAILURE;
    }

    printf("%s:	d1 = %.0f
	    %s in binary
", argv[1], d1, bits(d1));
    printf("%s:	d2 = %.0f
	    %s in binary
", argv[2], d2, bits(d2));

    printf("
Defaults:
");
    printf("Product = %.0f
	  %s in binary
", d1 * d2, bits(d1 * d2));

    printf("
Extended precision, rounding to nearest integer:
");
    fp387(FP387_EXTENDED | FP387_NEAREST);
    printf("Product = %.0f
	  %s in binary
", d1 * d2, bits(d1 * d2));

    printf("
Double precision, rounding to nearest integer:
");
    fp387(FP387_DOUBLE | FP387_NEAREST);
    printf("Product = %.0f
	  %s in binary
", d1 * d2, bits(d1 * d2));

    printf("
Extended precision, rounding to zero:
");
    fp387(FP387_EXTENDED | FP387_ZERO);
    printf("Product = %.0f
	  %s in binary
", d1 * d2, bits(d1 * d2));

    printf("
Double precision, rounding to zero:
");
    fp387(FP387_DOUBLE | FP387_ZERO);
    printf("Product = %.0f
	  %s in binary
", d1 * d2, bits(d1 * d2));

    return 0;
}

Using clang-llvm-3.0 to compile and run, I get the correct results,

clang -std=c99 -m32 -mno-sse -mfpmath=387 -O3 -W -Wall precision.c -o precision
./precision 7491907632491941888 5698883734965350400

7491907632491941888:    d1 = 7491907632491941888
        0100001111011001111111100010011010010011000100010010111000010100 in binary
5698883734965350400:    d2 = 5698883734965350400
        0100001111010011110001011010000000100100000001111011011100011100 in binary

Defaults:
Product = 42695510550671098263984292201741942784
          0100011111000000000011110110110110011000110100001010010000110000 in binary

Extended precision, rounding to nearest integer:
Product = 42695510550671098263984292201741942784
          0100011111000000000011110110110110011000110100001010010000110000 in binary

Double precision, rounding to nearest integer:
Product = 42695510550671088819251326462451515392
          0100011111000000000011110110110110011000110100001010010000101111 in binary

Extended precision, rounding to zero:
Product = 42695510550671088819251326462451515392
          0100011111000000000011110110110110011000110100001010010000101111 in binary

Double precision, rounding to zero:
Product = 42695510550671088819251326462451515392
          0100011111000000000011110110110110011000110100001010010000101111 in binary

In other words, you can work around the compiler issues by using fp387() to set the precision and rounding mode.

The downside is that some math libraries (libm.a, libm.so) may be written with the assumption that intermediate results are always computed at 80-bit precision. At least the GNU C library fpu_control.h on x86_64 has the comment "libm requires extended precision". Fortunately, you can take the '387 implementations from e.g. GNU C library, and implement them in a header file or write a known-to-work libm, if you need the math.h functionality; in fact, I think I might be able to help there.

这篇关于是否有文档描述 Clang 如何处理过多的浮点精度?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是否有文档描述 Clang 如何处理过多的浮点精度? [英] Is there a document describing how Clang handles excess floating-point precision?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

是否有文档描述 Clang 如何处理过多的浮点精度? [英] Is there a document describing how Clang handles excess floating-point precision?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭