是否有描述锵如何处理多余的浮点precision文档? [英] Is there a document describing how Clang handles excess floating-point precision?

查看:152
本文介绍了是否有描述锵如何处理多余的浮点precision文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这几乎是不可能的(*),以合理的成本的时候才会浮点指令之一是允许用于提供严格的IEEE 754语义是387人。这是当一个人希望保持FPU的完整的64位有效数字的工作,使长双类型可用于扩展precision特别难。通常的溶液,是在唯一可用precision做中间计算,并转换为较低precision在或多或少明确的场合。

GCC的最新版本根据约瑟夫·迈尔斯在的 2008年后,以海湾合作委员会的邮件列表。这说明以使编译的程序的gcc -std = C99 -mno-SSE2 -mfpmath = 387 完全predictable,到了最后一位,据我了解。而如果碰巧没有,这是一个错误,这将是固定的:约瑟夫·迈尔斯在他的文章陈述的目的是要使它predictable

它是记录锵如何处理多余的precision(当选项 -mno-SSE2 用于发言权),以及在哪里?

(*)编辑:这是一种夸张。这是<一个href=\"http://stackoverflow.com/questions/18496560/how-do-java-runtimes-targeting-$p$p-sse2-processors-implement-floating-point-basi\">slightly恼人,但并不难效仿bin​​ary64当一个人被允许配置的x87 FPU使用53位有效数字。


继由R评论...下面,这里是最新版本的锵我的一个简短的互动我的日志:

 十六进制:〜$铛-v
苹果铛版本4.1(标签/苹果/铛-421.11.66)(基于LLVM 3.1svn)
目标:x86_64的 - 苹果darwin12.4.0
线程模型:POSIX
六:〜$猫fem.c
#包括LT&;&stdio.h中GT;
#包括LT&;&math.h中GT;
#包括LT&;&FLOAT.H GT;
#包括LT&;&fenv.h GT;双X;
双Y = 2.0;
双Z = 1.0;诠释主(){
  X = Y + Z;
  的printf(%d个\\ N(INT)FLT_EVAL_METHOD);
}
六:〜$铿锵-std = C99 -mno-SSE2 fem.c
六:〜$ ./a.out
0
六:〜$铿锵-std = C99 -mno-SSE2 -S fem.c
六:〜$猫fem.s
...
    MOVL $ 0%ESI
    fldl _y(%RIP)
    fldl _z(%RIP)
    FADDP%ST(1)
    MOVQ _x @ GOTPCREL(%RIP),RAX%
    fstpl(RAX%)
...


解决方案

这并没有回答最初提出的问题,但如果你是一个程序员有类似问题的工作,这个答案可能会帮助你。

我真的没有看到感知难度。提供严格的IEEE-754 binary64语义而被限制在80387浮点运算,并保留80位长双计算,似乎遵循既GCC-4.6.3和铛-3.0以及指定C99铸造规则(基于LLVM 3.0)。

编辑补充:然而,帕斯卡Cuoq是正确的:既不是GCC-4.6.3或铿锵,LLVM 3.0实际执行这些规则的正确的为'387浮点运算。给予适当的编译器选项,规则正确应用到在编译时评估前pressions,但不运行时前pressions。有解决方法,如下破发后上市。

我做的分子动力学模拟code和我非常熟悉的重复性/ predictability要求,也与保留可用的最高precision尽可能的欲望,所以我要求我知道我这里讲的。这个答案应该显示工具存在,简单易用;从不察觉或不使用这些工具产生的问题。

(A preferred例如我很喜欢,就是Kahan的总和算法,随着C99和适当的铸造(添加强制类型转换到如维基百科例如code),没有章法或额外的临时变量都需要在所有。实施作品无论编译器优化级别,包括 -O3 -Ofast

C99明确规定(例如中5.4.2.2)的铸造和分配都删除所有额外的范围和precision。这意味着,你可以通过定义计算过程中使用的临时变量长双,也铸造您使用长双算术输入变量为该类型;每当需要一个IEEE-754 binary64,只是转换为双击

在'387,中投产生的分配和上述两种编译器的负载;这并不正确轮80位值IEEE-754 binary64。这个成本在我看来很合理的。采取的确切时间取决于体系结构和周围code;通常它是可与其他code进行交织,以使成本降到可忽略的水平。当MMX,SSE或AVX是可用的,它们的寄存器是从80位寄存器80387分开的,和演员通常是由值移动到MMX / SSE / AVX寄存器完成的。

(I preFER生产code使用特定的浮点型,比如 tempdouble 或如此,临时变量,因此,它可以是定义为双击长双取决于架构和速度/所需precision折衷。)

在一言以蔽之:


  

不要以为(例如pression) precision只是因为所有变量和文字常量是。它写成(双)(如pression)如果你想在结果双 precision


这适用于复合前pressions,也和有时会导致笨拙的前pressions用铸件的许多层面。

如果您有表达式1 表达式2 您要计算在80位precision,但还需要64位的每一个圆润的首创产品,用

 长双表达式1;
长双表达式2;
双产品=(双)(表达式1)*(双)(表达式2);

请注意,产品被计算为两个64位值的乘积;的的80位precision计算,然后四舍五入。在计算80位precision产品,然后向下舍入,将

 双等=表达式1 *表达式2;

或添加描述性蒙上了告诉你到底发生了什么,

 双等=(双)((长双)(表达式1)*(长双)(表达式2));

这应该是显而易见的产品往往不同。

在C99铸造规则只是一个工具,你必须学会​​挥舞,如果你不混合32位/ 64位/ 80位/ 128位浮点值工作。真的,你遇到了相同的问题,如果混合binary32和binary64彩车(浮动双击在大多数架构)!

或许改写帕斯卡Cuoq勘探code,正确运用铸造规则,使之更清楚?

 的#include&LT;&stdio.h中GT;的#define TEST(当量)的printf(% -  56S%S \\ N,#当量:,(当量)真:假)INT主要(无效)
{
    双D = 1.0 / 10.0;
    长双LD = 1.0L / 10.0L;    的printf(的sizeof(双)=%d个\\ N(INT)的sizeof(双));
    的printf(的sizeof(长双)==%d个\\ N(INT)的sizeof(长双));    的printf(\\ nExpect真:\\ n);
    TEST(D ==(双)(0.1));
    TEST(LD ==(长双)(0.1L));
    TEST(D ==(双)(1.0 / 10.0));
    TEST(LD ==(长双)(1.0L / 10.0L));
    TEST(D ==(双)(LD));
    TEST((双)(1.0L / 10.0L)==(双)(0.1));
    TEST((长双)(1.0L / 10.0L)==(长双)(0.1L));    的printf(\\ nExpect假:\\ n);
    TEST(D == LD);
    TEST((长双)(D)== LD);
    TEST(D == 0.1L);
    TEST(LD == 0.1);
    TEST(D = =(长双)(1.0L / 10.0L));
    TEST(LD ==(双)(1.0L / 10.0));    返回0;
}

输出,与GCC和铿锵,是

 的sizeof(双)= 8
的sizeof(长双)== 12想到真:
ð==(双)(0.1):真
LD ==(长双)(0.1L):真
ð==(双)(1.0 / 10.0):真
LD ==(长双)(1.0L / 10.0L):真
ð==(双)(LD):真
(双)(1.0L / 10.0L)==(双)(0.1):真
(长双)(1.0L / 10.0L)==(长双)(0.1L):真预计假:
ð== LD:假的
(长双)(D)== LD:假的
ð== 0.1L:假的
LD == 0.1:假的
ð==(长双)(1.0L / 10.0L):假
LD ==(双)(1.0L / 10.0):假

除了最近的海湾合作​​委员会的版本促进 LD == 0.1 的右手边长双第一(即 LD == 0.1L ),产生真正,以及与SSE / AVX,长双是128-位。

对于纯387的测试中,我用

  GCC -W -Wall -m32 -mfpmath = 387 -mno-SSE ... test.c的-o测试
铛-W -Wall -m32 -mfpmath = 387 -mno-SSE ... test.c的-o测试

使用各种优化标志组合为 ... ,其中 -fomit-frame-pointer的 -O0 -O1 -O2 -O3 -Os

使用任何其他标​​志或C99编译器应该导致同样的结果,除了长双尺寸(和 LD == 1.0 当前版本的GCC)。如果您遇到任何区别,我会非常感激听到他们;我可能需要警告我这样的编译器(编译器版本)的用户。需要注意的是微软不支持C99,所以他们完全无趣给我。


帕斯卡尔Cuoq确实带来了一个有趣的问题,在下面的评论链,我没有立即认出。

当评估一个前pression,GCC和铛与 -mfpmath = 387 指定所有前pressions正在使用80位$ P评估$ pcision。这导致例如:

  7491907632491941888 = 0x1.9fe2693112e14p + 62 = 110011111111000100110100100110001000100101110000101000000000000
5698883734965350400 = 0x1.3c5a02407b71cp + 62 = 1001111000101101000000010010000000111101101110001110000000000007491907632491941888 * 5698883734965350400 = 42695510550671093541385598890357555200 = 100000000111101101101100110001101000010100100001011110111111111111110011000111000001011101010101100011000000000000000000000000

,得到不正确的结果,因为那些在二进制结果的中间的该字符串只是在53-位和64位尾数(64和80位分别浮点数)之间的差别。因此,尽管预期的结果是

  42695510550671088819251326462451515392 = 0x1.00f6d98d0a42fp + 125 = 100000000111101101101100110001101000010100100001011110000000000000000000000000000000000000000000000000000000000000000000000000

与得到的结果只是 -std = C99 -m32 -mno-SSE -mfpmath = 387

  42695510550671098263984292201741942784 = 0x1.00f6d98d0a43p + 125 = 100000000111101101101100110001101000010100100001100000000000000000000000000000000000000000000000000000000000000000000000000000

在理论上,你应该能够告诉gcc和铛通过选项来执行正确的C99四舍五入规则

  -std = C99 -m32 -mno-SSE -mfpmath = 387 -ffloat店-fexcess- precision =标准

然而,这只会影响前pressions编译器优化,似乎并没有在所有修复387处理。如果你使用例如铛-O1 -std = C99 -m32 -mno-SSE -mfpmath = 387 -ffloat店-fexcess- precision =标准test.c的-o测试&放大器;&安培; ./test test.c以 Cuoq的示例程序,你会得到每IEEE-754规则正确的结果 - 但仅仅是因为编译器优化掉前pression,不使用387在所有

简而言之,而不是计算

 (双)D1 *(双)D2

GCC和铛居然告诉387来计算

 (双)((长双)D1 *(长双)D2)

这的确是线我相信这是影响到GCC-4.6.3和铛,LLVM-3.0编译器错误,和易于再现的一个。 (帕斯卡Cuoq指出, FLT_EVAL_METHOD = 2 意味着双precision参数的操作总是在扩展precision做,但我看不到任何理智的理由 - - 除了不必重写的387 的libm 部分 - 做,在C99和考虑IEEE-754规则是实现由硬件毕竟,正确的操作是容易实现的编译器,通过修改387控制字相匹配的前pression的precision。而且,考虑到的应迫使这种行为 - -std = C99 -ffloat店-fexcess- precision =标准 - 毫无意义,如果 = FLT_EVAL_METHOD 2 行为实际上是所希望的,不存在后向兼容性问题,要么。)要注意的是给定适当的编译器标记是很重要的,在编译时计算前pressions不被正确评价,并在运行时只计算前pressions得到不正确的结果。

最简单的解决方法,以及便携式之一,是使用 fesetround(FE_TOWARDZERO)(从 fenv.h )四舍五入所有结果接近零。

在某些情况下,向零舍入可帮助predictability和病理情况下。特别是,对于像间隔X = [0,1),向零取整意味着上限是从来没有经过四舍五入达到;重要的,如果你如评估分段样条曲线。

对于其他舍入模式,则需要直接控制硬件387

您可以使用 __ FPU_SETCW()的#include&LT; fpu_control.h&GT; ,或开放code将其。例如, precision.c

 的#include&LT;&stdlib.h中GT;
#包括LT&;&stdio.h中GT;
#包括LT&;&limits.h中GT;#定义FP387_NEAREST为0x0000
#定义FP387_ZERO 0x0C00
#定义FP387_UP为0x0800
#定义FP387_DOWN的0x0400#定义FP387_SINGLE为0x0000
#定义FP387_DOUBLE 0x0200
#定义FP387_EXTENDED 0x0300静态内嵌无效fp387(const的无符号短控制)
{
    无符号短CW =(控制及; 0x0F00)| 0X007F;
    __asm​​__挥发性(FLDCW%0:M(*&安培; CW));
}为const char *位(常量double值)
{
    const的无符号char * const的数据=(const的无符号字符*)及价值;
    静态字符缓冲区[CHAR_BIT * sizeof的值+ 1];
    的char * p =缓冲;
    为size_t I = CHAR_BIT * sizeof的价值;
    而(ⅰ - &0)
        *(P ++)='0'+ !!(数据[I / CHAR_BIT]及(1U&所述;≤(I%CHAR_BIT)));
    * P ='\\ 0';
    收益率(为const char *)缓冲;
}
INT主(INT ARGC,CHAR *的argv [])
{
    双D1,D2;
    烧焦假;    如果(argc个!= 3){
        fprintf中(标准错误,\\ n用法:%s的7491907632491941888 5698883734965350400 \\ n \\ n,argv的[0]);
        返回EXIT_FAILURE;
    }    如果(sscanf的(的argv [1],%LF%C,&放大器D1,放大器;!虚设)= 1){
        fprintf中(标准错误,%S:不是一个数字\\ n,ARGV [1]);
        返回EXIT_FAILURE;
    }
    如果(sscanf的(的argv [2],%LF%C,&放大器; D2,&放大器;!虚设)= 1){
        fprintf中(标准错误,%S:不是一个数字\\ n,argv的[2]);
        返回EXIT_FAILURE;
    }    的printf(%S:\\ TD1 =剩余%.0f \\ n \\吨二进制\\ n%S的argv [1],D1位(D1));
    的printf(%S:\\ TD2 =剩余%.0f \\ n \\吨二进制\\ n%S的argv [2],D2位(D2));    的printf(\\ nDefaults:\\ n);
    的printf(产品=剩余%.0f \\ n \\ t%s的二进制\\ n,D1 D2 *,比特(D1 D2 *));    的printf(\\ nExtended precision,四舍五入到最接近的整数:\\ n);
    fp387(FP387_EXTENDED | FP387_NEAREST);
    的printf(产品=剩余%.0f \\ n \\ t%s的二进制\\ n,D1 D2 *,比特(D1 D2 *));    的printf(\\ nDouble precision,四舍五入到最接近的整数:\\ n);
    fp387(FP387_DOUBLE | FP387_NEAREST);
    的printf(产品=剩余%.0f \\ n \\ t%s的二进制\\ n,D1 D2 *,比特(D1 D2 *));    的printf(\\ nExtended precision,四舍五入为零:\\ n);
    fp387(FP387_EXTENDED | FP387_ZERO);
    的printf(产品=剩余%.0f \\ n \\ t%s的二进制\\ n,D1 D2 *,比特(D1 D2 *));    的printf(\\ nDouble precision,四舍五入为零:\\ n);
    fp387(FP387_DOUBLE | FP387_ZERO);
    的printf(产品=剩余%.0f \\ n \\ t%s的二进制\\ n,D1 D2 *,比特(D1 D2 *));    返回0;
}

使用铛-LLVM 3.0编译并运行,我得到正确的结果,

 铛-std = C99 -m32 -mno-SSE -mfpmath = 387 -O3 -W -Wall precision.c -o precision
./$p$pcision 7491907632491941888 56988837349653504007491907632491941888:D1 = 7491907632491941888
        0100001111011001111111100010011010010011000100010010111000010100二进制
5698883734965350400:D2 = 5698883734965350400
        0100001111010011110001011010000000100100000001111011011100011100二进制默认值:
产品= 42695510550671098263984292201741942784
          0100011111000000000011110110110110011000110100001010010000110000二进制扩展precision,四舍五入到最接近的整数:
产品= 42695510550671098263984292201741942784
          0100011111000000000011110110110110011000110100001010010000110000二进制双precision,四舍五入到最接近的整数:
产品= 42695510550671088819251326462451515392
          0100011111000000000011110110110110011000110100001010010000101111二进制扩展precision,四舍五入为零:
产品= 42695510550671088819251326462451515392
          0100011111000000000011110110110110011000110100001010010000101111二进制双precision,四舍五入为零:
产品= 42695510550671088819251326462451515392
          0100011111000000000011110110110110011000110100001010010000101111二进制

在换句话说,你可以通过使用解决的问题,编译 fp387()来设置precision和舍入模式。

缺点是,一些数学库( libm.a libm.so )可以用书面假设中间结果在80位precision始终计算。至少GNU C库 fpu_control.h 在x86_64有注释的的libm需要扩展precision的。幸运的是,你可以从例如采取387实现GNU C库,并实现它们在头文件或者写一个已知的对工作的libm ,如果你需要在文件math.h 的功能;其实,我想我可能能帮助那里。

It is nearly impossible(*) to provide strict IEEE 754 semantics at reasonable cost when the only floating-point instructions one is allowed to used are the 387 ones. It is particularly hard when one wishes to keep the FPU working on the full 64-bit significand so that the long double type is available for extended precision. The usual "solution" is to do intermediate computations at the only available precision, and to convert to the lower precision at more or less well-defined occasions.

Recent versions of GCC handle excess precision in intermediate computations according to the interpretation laid out by Joseph S. Myers in a 2008 post to the GCC mailing list. This description makes a program compiled with gcc -std=c99 -mno-sse2 -mfpmath=387 completely predictable, to the last bit, as far as I understand. And if by chance it doesn't, it is a bug and it will be fixed: Joseph S. Myers' stated intention in his post is to make it predictable.

Is it documented how Clang handles excess precision (say when the option -mno-sse2 is used), and where?

(*) EDIT: this is an exaggeration. It is slightly annoying but not that difficult to emulate binary64 when one is allowed to configure the x87 FPU to use a 53-bit significand.


Following a comment by R.. below, here is the log of a short interaction of mine with the most recent version of Clang I have :

Hexa:~ $ clang -v
Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.4.0
Thread model: posix
Hexa:~ $ cat fem.c
#include <stdio.h>
#include <math.h>
#include <float.h>
#include <fenv.h>

double x;
double y = 2.0;
double z = 1.0;

int main(){
  x = y + z;
  printf("%d\n", (int) FLT_EVAL_METHOD);
}
Hexa:~ $ clang -std=c99 -mno-sse2 fem.c
Hexa:~ $ ./a.out 
0
Hexa:~ $ clang -std=c99 -mno-sse2 -S fem.c
Hexa:~ $ cat fem.s 
…
    movl    $0, %esi
    fldl    _y(%rip)
    fldl    _z(%rip)
    faddp   %st(1)
    movq    _x@GOTPCREL(%rip), %rax
    fstpl   (%rax)
…

解决方案

This does not answer the originally posed question, but if you are a programmer working with similar issues, this answer might help you.

I really don't see where the perceived difficulty is. Providing strict IEEE-754 binary64 semantics while being limited to 80387 floating-point math, and retaining 80-bit long double computation, seems to follow well-specified C99 casting rules with both GCC-4.6.3 and clang-3.0 (based on LLVM 3.0).

Edited to add: Yet, Pascal Cuoq is correct: neither gcc-4.6.3 or clang-llvm-3.0 actually enforce those rules correctly for '387 floating-point math. Given the proper compiler options, the rules are correctly applied to expressions evaluated at compile time, but not for run-time expressions. There are workarounds, listed after the break below.

I do molecular dynamics simulation code, and am very familiar with the repeatability/predictability requirements and also with the desire to retain maximum precision available when possible, so I do claim I know what I am talking about here. This answer should show that the tools exist and are simple to use; the problems arise from not being aware of or not using those tools.

(A preferred example I like, is the Kahan summation algorithm. With C99 and proper casting (adding casts to e.g. Wikipedia example code), no tricks or extra temporary variables are needed at all. The implementation works regardless of compiler optimization level, including at -O3 and -Ofast.)

C99 explicitly states (in e.g. 5.4.2.2) that casting and assignment both remove all extra range and precision. This means that you can use long double arithmetic by defining your temporary variables used during computation as long double, also casting your input variables to that type; whenever a IEEE-754 binary64 is needed, just cast to a double.

On '387, the cast generates an assignment and a load on both the above compilers; this does correctly round the 80-bit value to IEEE-754 binary64. This cost is very reasonable in my opinion. The exact time taken depends on the architecture and surrounding code; usually it is and can be interleaved with other code to bring the cost down to neglible levels. When MMX, SSE or AVX are available, their registers are separate from the 80-bit 80387 registers, and the cast usually is done by moving the value to the MMX/SSE/AVX register.

(I prefer production code to use a specific floating-point type, say tempdouble or such, for temporary variables, so that it can be defined to either double or long double depending on architecture and speed/precision tradeoffs desired.)

In a nutshell:

Don't assume (expression) is of double precision just because all the variables and literal constants are. Write it as (double)(expression) if you want the result at double precision.

This applies to compound expressions, too, and may sometimes lead to unwieldy expressions with many levels of casts.

If you have expr1 and expr2 that you wish to compute at 80-bit precision, but also need the product of each rounded to 64-bit first, use

long double  expr1;
long double  expr2;
double       product = (double)(expr1) * (double)(expr2);

Note, product is computed as a product of two 64-bit values; not computed at 80-bit precision, then rounded down. Calculating the product at 80-bit precision, then rounding down, would be

double       other = expr1 * expr2;

or, adding descriptive casts that tell you exactly what is happening,

double       other = (double)((long double)(expr1) * (long double)(expr2));

It should be obvious that product and other often differ.

The C99 casting rules are just another tool you must learn to wield, if you do work with mixed 32-bit/64-bit/80-bit/128-bit floating point values. Really, you encounter the exact same issues if you mix binary32 and binary64 floats (float and double on most architectures)!

Perhaps rewriting Pascal Cuoq's exploration code, to correctly apply casting rules, makes this clearer?

#include <stdio.h>

#define TEST(eq) printf("%-56s%s\n", "" # eq ":", (eq) ? "true" : "false")

int main(void)
{
    double d = 1.0 / 10.0;
    long double ld = 1.0L / 10.0L;

    printf("sizeof (double) = %d\n", (int)sizeof (double));
    printf("sizeof (long double) == %d\n", (int)sizeof (long double));

    printf("\nExpect true:\n");
    TEST(d == (double)(0.1));
    TEST(ld == (long double)(0.1L));
    TEST(d == (double)(1.0 / 10.0));
    TEST(ld == (long double)(1.0L / 10.0L));
    TEST(d == (double)(ld));
    TEST((double)(1.0L/10.0L) == (double)(0.1));
    TEST((long double)(1.0L/10.0L) == (long double)(0.1L));

    printf("\nExpect false:\n");
    TEST(d == ld);
    TEST((long double)(d) == ld);
    TEST(d == 0.1L);
    TEST(ld == 0.1);
    TEST(d == (long double)(1.0L / 10.0L));
    TEST(ld == (double)(1.0L / 10.0));

    return 0;
}

The output, with both GCC and clang, is

sizeof (double) = 8
sizeof (long double) == 12

Expect true:
d == (double)(0.1):                                     true
ld == (long double)(0.1L):                              true
d == (double)(1.0 / 10.0):                              true
ld == (long double)(1.0L / 10.0L):                      true
d == (double)(ld):                                      true
(double)(1.0L/10.0L) == (double)(0.1):                  true
(long double)(1.0L/10.0L) == (long double)(0.1L):       true

Expect false:
d == ld:                                                false
(long double)(d) == ld:                                 false
d == 0.1L:                                              false
ld == 0.1:                                              false
d == (long double)(1.0L / 10.0L):                       false
ld == (double)(1.0L / 10.0):                            false

except that recent versions of GCC promote the right hand side of ld == 0.1 to long double first (i.e. to ld == 0.1L), yielding true, and that with SSE/AVX, long double is 128-bit.

For the pure '387 tests, I used

gcc -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test
clang -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test

with various optimization flag combinations as ..., including -fomit-frame-pointer, -O0, -O1, -O2, -O3, and -Os.

Using any other flags or C99 compilers should lead to the same results, except for long double size (and ld == 1.0 for current GCC versions). If you encounter any differences, I'd be very grateful to hear about them; I may need to warn my users of such compilers (compiler versions). Note that Microsoft does not support C99, so they are completely uninteresting to me.


Pascal Cuoq does bring up an interesting problem in the comment chain below, which I didn't immediately recognize.

When evaluating an expression, both GCC and clang with -mfpmath=387 specify that all expressions are evaluated using 80-bit precision. This leads to for example

7491907632491941888 = 0x1.9fe2693112e14p+62 = 110011111111000100110100100110001000100101110000101000000000000
5698883734965350400 = 0x1.3c5a02407b71cp+62 = 100111100010110100000001001000000011110110111000111000000000000

7491907632491941888 * 5698883734965350400 = 42695510550671093541385598890357555200 = 100000000111101101101100110001101000010100100001011110111111111111110011000111000001011101010101100011000000000000000000000000

yielding incorrect results, because that string of ones in the middle of the binary result is just at the difference between 53- and 64-bit mantissas (64 and 80-bit floating point numbers, respectively). So, while the expected result is

42695510550671088819251326462451515392 = 0x1.00f6d98d0a42fp+125 = 100000000111101101101100110001101000010100100001011110000000000000000000000000000000000000000000000000000000000000000000000000

the result obtained with just -std=c99 -m32 -mno-sse -mfpmath=387 is

42695510550671098263984292201741942784 = 0x1.00f6d98d0a43p+125 = 100000000111101101101100110001101000010100100001100000000000000000000000000000000000000000000000000000000000000000000000000000

In theory, you should be able to tell gcc and clang to enforce the correct C99 rounding rules by using options

-std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard

However, this only affects expressions the compiler optimizes, and does not seem to fix the 387 handling at all. If you use e.g. clang -O1 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard test.c -o test && ./test with test.c being Pascal Cuoq's example program, you will get the correct result per IEEE-754 rules -- but only because the compiler optimizes away the expression, not using the 387 at all.

Simply put, instead of computing

(double)d1 * (double)d2

both gcc and clang actually tell the '387 to compute

(double)((long double)d1 * (long double)d2)

This is indeed I believe this is a compiler bug affecting both gcc-4.6.3 and clang-llvm-3.0, and an easily reproduced one. (Pascal Cuoq points out that FLT_EVAL_METHOD=2 means operations on double-precision arguments is always done at extended precision, but I cannot see any sane reason -- aside from having to rewrite parts of libm on '387 -- to do that in C99 and considering IEEE-754 rules are achievable by the hardware! After all, the correct operation is easily achievable by the compiler, by modifying the '387 control word to match the precision of the expression. And, given the compiler options that should force this behaviour -- -std=c99 -ffloat-store -fexcess-precision=standard -- make no sense if FLT_EVAL_METHOD=2 behaviour is actually desired, there is no backwards compatibility issues, either.) It is important to note that given the proper compiler flags, expressions evaluated at compile time do get evaluated correctly, and that only expressions evaluated at run time get incorrect results.

The simplest workaround, and the portable one, is to use fesetround(FE_TOWARDZERO) (from fenv.h) to round all results towards zero.

In some cases, rounding towards zero may help with predictability and pathological cases. In particular, for intervals like x = [0,1), rounding towards zero means the upper limit is never reached through rounding; important if you evaluate e.g. piecewise splines.

For the other rounding modes, you need to control the 387 hardware directly.

You can use either __FPU_SETCW() from #include <fpu_control.h>, or open-code it. For example, precision.c:

#include <stdlib.h>
#include <stdio.h>
#include <limits.h>

#define FP387_NEAREST   0x0000
#define FP387_ZERO      0x0C00
#define FP387_UP        0x0800
#define FP387_DOWN      0x0400

#define FP387_SINGLE    0x0000
#define FP387_DOUBLE    0x0200
#define FP387_EXTENDED  0x0300

static inline void fp387(const unsigned short control)
{
    unsigned short cw = (control & 0x0F00) | 0x007f;
    __asm__ volatile ("fldcw %0" : : "m" (*&cw));
}

const char *bits(const double value)
{
    const unsigned char *const data = (const unsigned char *)&value;
    static char buffer[CHAR_BIT * sizeof value + 1];
    char       *p = buffer;
    size_t      i = CHAR_BIT * sizeof value;
    while (i-->0)
        *(p++) = '0' + !!(data[i / CHAR_BIT] & (1U << (i % CHAR_BIT)));
    *p = '\0';
    return (const char *)buffer;
}


int main(int argc, char *argv[])
{
    double  d1, d2;
    char    dummy;

    if (argc != 3) {
        fprintf(stderr, "\nUsage: %s 7491907632491941888 5698883734965350400\n\n", argv[0]);
        return EXIT_FAILURE;
    }

    if (sscanf(argv[1], " %lf %c", &d1, &dummy) != 1) {
        fprintf(stderr, "%s: Not a number.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (sscanf(argv[2], " %lf %c", &d2, &dummy) != 1) {
        fprintf(stderr, "%s: Not a number.\n", argv[2]);
        return EXIT_FAILURE;
    }

    printf("%s:\td1 = %.0f\n\t    %s in binary\n", argv[1], d1, bits(d1));
    printf("%s:\td2 = %.0f\n\t    %s in binary\n", argv[2], d2, bits(d2));

    printf("\nDefaults:\n");
    printf("Product = %.0f\n\t  %s in binary\n", d1 * d2, bits(d1 * d2));

    printf("\nExtended precision, rounding to nearest integer:\n");
    fp387(FP387_EXTENDED | FP387_NEAREST);
    printf("Product = %.0f\n\t  %s in binary\n", d1 * d2, bits(d1 * d2));

    printf("\nDouble precision, rounding to nearest integer:\n");
    fp387(FP387_DOUBLE | FP387_NEAREST);
    printf("Product = %.0f\n\t  %s in binary\n", d1 * d2, bits(d1 * d2));

    printf("\nExtended precision, rounding to zero:\n");
    fp387(FP387_EXTENDED | FP387_ZERO);
    printf("Product = %.0f\n\t  %s in binary\n", d1 * d2, bits(d1 * d2));

    printf("\nDouble precision, rounding to zero:\n");
    fp387(FP387_DOUBLE | FP387_ZERO);
    printf("Product = %.0f\n\t  %s in binary\n", d1 * d2, bits(d1 * d2));

    return 0;
}

Using clang-llvm-3.0 to compile and run, I get the correct results,

clang -std=c99 -m32 -mno-sse -mfpmath=387 -O3 -W -Wall precision.c -o precision
./precision 7491907632491941888 5698883734965350400

7491907632491941888:    d1 = 7491907632491941888
        0100001111011001111111100010011010010011000100010010111000010100 in binary
5698883734965350400:    d2 = 5698883734965350400
        0100001111010011110001011010000000100100000001111011011100011100 in binary

Defaults:
Product = 42695510550671098263984292201741942784
          0100011111000000000011110110110110011000110100001010010000110000 in binary

Extended precision, rounding to nearest integer:
Product = 42695510550671098263984292201741942784
          0100011111000000000011110110110110011000110100001010010000110000 in binary

Double precision, rounding to nearest integer:
Product = 42695510550671088819251326462451515392
          0100011111000000000011110110110110011000110100001010010000101111 in binary

Extended precision, rounding to zero:
Product = 42695510550671088819251326462451515392
          0100011111000000000011110110110110011000110100001010010000101111 in binary

Double precision, rounding to zero:
Product = 42695510550671088819251326462451515392
          0100011111000000000011110110110110011000110100001010010000101111 in binary

In other words, you can work around the compiler issues by using fp387() to set the precision and rounding mode.

The downside is that some math libraries (libm.a, libm.so) may be written with the assumption that intermediate results are always computed at 80-bit precision. At least the GNU C library fpu_control.h on x86_64 has the comment "libm requires extended precision". Fortunately, you can take the '387 implementations from e.g. GNU C library, and implement them in a header file or write a known-to-work libm, if you need the math.h functionality; in fact, I think I might be able to help there.

这篇关于是否有描述锵如何处理多余的浮点precision文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆