gcc 4.8 AVX优化bug：额外的代码插入？ [英] gcc 4.8 AVX optimization bug: extra code insertion?

查看：241 发布时间：2018/4/18 20:33:11 gcc optimization g++ sse avx

本文介绍了gcc 4.8 AVX优化bug：额外的代码插入？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

gcc编译器4.8带有-Ofast选项的AVX优化是很棒的。但是，我发现了一个有趣但愚蠢的错误，它增加了不必要的额外计算。也许我错了，有人可以给我一个解释吗？

原始的C ++源代码如下：

<$ p [p] #define N 1000007 float a [N]，b [N]，c [N]，d [N]，e [N]; int main（int argc，char * argv []）{ cout<< a<< ''<< b< ''<< c< ENDL; for（int x = 0; x c [x] = 1 / sqrt（（a [x] + b [x] -c [x]）* d [X] / E [X]）; } 返回0; }

代码在Ubuntu 14.04.3 x86_64中使用g ++ 4.8.4进行编译：
g ++ -mavx avx.cpp -masm = intel -c -g -Wa，-ahl = avx.asm -Ofast

汇编源代码为如下：

  90 .LVL10：
 91 006b C5FC2825 vmovaps ymm4，YMMWORD PTR .LC0 [rip] 
 91 00000000 
 92 0073 31C0 xor eax，eax 
 93 0075 C5FC281D vmovaps ymm3，YMMWORD PTR .LC1 [rip] 
 25：avx.cpp **** for（int x = 0 ;（a [x] + b [x] -c [x]）* d（x  26：avx.cpp **** c [x] = 1 / sqrt [X] / E [X]）; 
 101 .loc 1 26 0鉴别器2 
 102 0080 C5FC2890 vmovaps ymm2，YMMWORD PTR b [rax] 
 102 00000000 
 103 0088 4883C020 add rax，32 
 104 008c C5FC2888 vmovaps ymm1，YMMWORD PTR e [rax-32] 
 104 00000000 
 105 0094 C5EC5890 vaddps ymm2，ymm2，YMMWORD PTR a [rax-32] 
 105 00000000 
 106 009c C5FC53C1 vrcpps ymm0，ymm1 
 107 00a0 C5FC59C9 vmulps ymm1，ymm0，ymm1 
 108 00a4 C5FC59C9 vmulps ymm1，ymm0，ymm1 
 109 00a8 C5EC5C90 vsubps ymm2，ymm2，YMMWORD PTR c [rax- 32] 
 109 00000000 
 110 00b0 C5FC58C0 vaddps ymm0，ymm0，ymm0 
 111 00b4 C5EC5990 vmulps ymm2，ymm2，YMMWORD PTR d [rax-32] 
 111 00000000 
 112 00bc C5FC5CC9 vsubps ymm1，ymm0，ymm1 
 113 00c0 C5EC59C1 vmulps ymm0，ymm2，ymm1 
 118 0 0c4 C5FC52C8 vrsqrtps ymm1，ymm0 
 119 00c8 C5F459C0 vmulps ymm0，ymm1，ymm0 
 120 00cc C5FC59C1 vmulps ymm0，ymm0，ymm1 
 121 00d0 C5F459CB vmulps ymm1，ymm1，ymm3 
 122 00d4 C5FC58C4 vaddps ymm0，ymm0，ymm4 
 ^ LGAS LISTING /tmp/ccJtIFtg.s page 21 
 
 
 123 00d8 C5FC59C9 vmulps ymm1，ymm0，ymm1 
 124。 LBE45：
 125 .LBE44：
 126 .loc 1 26 0鉴别器2 
 127 00dc C5FC2988 vmovaps YMMWORD PTR c [rax-32]，ymm1 
 127 00000000 
 128 00e4 483D0009 cmp rax，4000000 
 128 3D00 
 129 00ea 7594 jne .L3

现在看第106,107,108,110,112和113行。

编译器用e [x]乘以倒数计算除法。所以第106行计算1 / e [x]，这是正确的。之后，它可以直接乘以（a [x] + b [x] -c [x]）* d [x]的最终积，它存储在ymm2的第111行。但是，不是这样做，编译器做了一些有趣而荒谬的事情：它首先将计算的倒数1 / e [x]乘以e [x]到
获得1（第107行）然后，将这1与1 / e [x]相乘以获得1 / e [x]（第108行）

然后它将1 / e [x]加到自身以获得2 / e [x]（第110行）

li>

然后通过1 / e [x]减去2 / e [x]来获得回1 / e [x]（Line 112）

之后，编译器巧妙地使用vrsqrtps指令来计算1 / sqrt（）。但是，之后会发生什么？它不是在ymm1中提取输出（第118行），它再次做了一些更奇特的事：

它首先乘以1 / sqrt（x）乘以x得到sqrt（x），（第119行）然后它将sqrt（x）乘以1 / sqrt（x）然后将1 / sqrt（x）乘以1（预先存储在ymm3中）以获得相同的1（行120）
/ sqrt（x），（Line 121）然后它将所获得的1加上0（预先存储在ymm4中）以获得1（行122 ）然后乘以1 / sqrt（x）与得到的1得到相同的1 / sqrt（x），（Line 123）

以上两个冗余表明，无论何时需要1 / x，编译器都会将已获得的输出与原始数据相乘编号以获得回1，然后将此1与已获得的输出相乘以获得相同的输出。是否有任何理由这样做？或者它只是一个bug？
解决方案
我认为你在生成的代码中看到的是 Newton-Raphson 改进估算 > vrcpps 。（请参阅： Intel Intrinsics Guide 以了解由 vrcpps 提供的初始估计的准确性。）

It is great that gcc compiler 4.8 comes with AVX optimization with -Ofast option. However, I found an interesting but stupid bug, that it adds additional computations which are unnecessary. Maybe I am wrong so can someone give me an explanation?

The original C++ source code is as follows:
#define N 1000007 float a[N],b[N],c[N],d[N],e[N]; int main(int argc, char *argv[]){ cout << a << ' ' << b << ' ' << c << endl; for(int x=0; x<N; ++x){ c[x] = 1/sqrt((a[x]+b[x]-c[x])*d[x]/e[x]); } return 0; }
The code is compiled using g++ 4.8.4 in Ubuntu 14.04.3 x86_64: g++ -mavx avx.cpp -masm=intel -c -g -Wa,-ahl=avx.asm -Ofast

The assembly source code is as follows:
90 .LVL10: 91 006b C5FC2825 vmovaps ymm4, YMMWORD PTR .LC0[rip] 91 00000000 92 0073 31C0 xor eax, eax 93 0075 C5FC281D vmovaps ymm3, YMMWORD PTR .LC1[rip] 25:avx.cpp **** for(int x=0; x<N; ++x){ 26:avx.cpp **** c[x] = 1/sqrt((a[x]+b[x]-c[x])*d[x]/e[x]); 101 .loc 1 26 0 discriminator 2 102 0080 C5FC2890 vmovaps ymm2, YMMWORD PTR b[rax] 102 00000000 103 0088 4883C020 add rax, 32 104 008c C5FC2888 vmovaps ymm1, YMMWORD PTR e[rax-32] 104 00000000 105 0094 C5EC5890 vaddps ymm2, ymm2, YMMWORD PTR a[rax-32] 105 00000000 106 009c C5FC53C1 vrcpps ymm0, ymm1 107 00a0 C5FC59C9 vmulps ymm1, ymm0, ymm1 108 00a4 C5FC59C9 vmulps ymm1, ymm0, ymm1 109 00a8 C5EC5C90 vsubps ymm2, ymm2, YMMWORD PTR c[rax-32] 109 00000000 110 00b0 C5FC58C0 vaddps ymm0, ymm0, ymm0 111 00b4 C5EC5990 vmulps ymm2, ymm2, YMMWORD PTR d[rax-32] 111 00000000 112 00bc C5FC5CC9 vsubps ymm1, ymm0, ymm1 113 00c0 C5EC59C1 vmulps ymm0, ymm2, ymm1 118 00c4 C5FC52C8 vrsqrtps ymm1, ymm0 119 00c8 C5F459C0 vmulps ymm0, ymm1, ymm0 120 00cc C5FC59C1 vmulps ymm0, ymm0, ymm1 121 00d0 C5F459CB vmulps ymm1, ymm1, ymm3 122 00d4 C5FC58C4 vaddps ymm0, ymm0, ymm4 ^LGAS LISTING /tmp/ccJtIFtg.s page 21 123 00d8 C5FC59C9 vmulps ymm1, ymm0, ymm1 124 .LBE45: 125 .LBE44: 126 .loc 1 26 0 discriminator 2 127 00dc C5FC2988 vmovaps YMMWORD PTR c[rax-32], ymm1 127 00000000 128 00e4 483D0009 cmp rax, 4000000 128 3D00 129 00ea 7594 jne .L3
Now look at line 106, 107, 108, 110, 112 and 113.

The compiler computes the division by e[x] using the multiplication by its inverse. So Line 106 computes 1/e[x], which is correct. After that it can directly multiply this with the final product of (a[x]+b[x]-c[x])*d[x], which is stored in ymm2, Line 111. However, instead of doing this, the compiler did something interesting and ridiculous:

it first multiplies the computed reciprocal 1/e[x] with e[x] to obtain 1 (Line 107)

then multiply this 1 with 1/e[x] to obtain back 1/e[x] (Line 108)

then it adds 1/e[x] to itself to obtain 2/e[x] (Line 110)

then it subtracts 2/e[x] by 1/e[x] to obtain back 1/e[x] (Line 112)

After that, the compiler is ingenious to use the vrsqrtps instruction to compute 1/sqrt(). However, after that, what happens? Instead of extracting the output in ymm1 (Line 118), it did something even more fanciful again:

it first multiplies 1/sqrt(x) by x to obtain sqrt(x), (Line 119)

it then multiplies the sqrt(x) by 1/sqrt(x) to obtain back 1, (Line 120)

it then multiplies 1/sqrt(x) by 1 (pre-stored in ymm3) to obtain the same 1/sqrt(x), (Line 121)

it then adds the obtained 1 by 0 (pre-stored in ymm4) to obtain 1, (Line 122)

it then multiplies 1/sqrt(x) with the obtained 1 to obtain back the same 1/sqrt(x), (Line 123)

The above two redundancies show that whenever 1/x is required, the compiler tends to multiply the already obtained output with the original number to obtain back 1, and then multiply this 1 with the already obtained output to obtain back the same output. Is there any reason for doing this? Or it is just a bug?
解决方案
I think what you are seeing in the generated code is an additional iteration of Newton-Raphson to refine the initial estimate provided by vrcpps. (See: the Intel Intrinsics Guide for details of the accuracy of the initial estimate provided by vrcpps.)

这篇关于gcc 4.8 AVX优化bug：额外的代码插入？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

gcc 4.8 AVX优化bug：额外的代码插入？ [英] gcc 4.8 AVX optimization bug: extra code insertion?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

gcc 4.8 AVX优化bug：额外的代码插入？ [英] gcc 4.8 AVX optimization bug: extra code insertion?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭