编译器为内在函数生成的汇编的问题 [英] Issues of compiler generated assembly for intrinsics

查看:136
本文介绍了编译器为内在函数生成的汇编的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Intel SSE/AVX/FMA内部函数来为某些数学函数实现完美的内联SSE/AVX指令.

I'm using Intel SSE/AVX/FMA intrinsics to achieve perfectly inlining SSE/AVX instructions for some math functions.

给出以下代码

#include <cmath>
#include <immintrin.h>

auto std_fma(float x, float y, float z)
{
    return std::fma(x, y, z);
}

float _fma(float x, float y, float z)
{
    _mm_store_ss(&x,
        _mm_fmadd_ss(_mm_load_ss(&x), _mm_load_ss(&y), _mm_load_ss(&z))
    );

    return x;
}

float _sqrt(float x)
{
    _mm_store_ss(&x,
        _mm_sqrt_ss(_mm_load_ss(&x))
    );

    return x;
}

clang 3.9生成的程序集使用-march = x86-64 -mfma -O3

the clang 3.9 generated assembly with -march=x86-64 -mfma -O3

std_fma(float, float, float):                          # @std_fma(float, float, float)
        vfmadd213ss     xmm0, xmm1, xmm2
        ret

_fma(float, float, float):                             # @_fma(float, float, float)
        vxorps  xmm3, xmm3, xmm3
        vmovss  xmm0, xmm3, xmm0        # xmm0 = xmm0[0],xmm3[1,2,3]
        vmovss  xmm1, xmm3, xmm1        # xmm1 = xmm1[0],xmm3[1,2,3]
        vmovss  xmm2, xmm3, xmm2        # xmm2 = xmm2[0],xmm3[1,2,3]
        vfmadd213ss     xmm0, xmm1, xmm2
        ret

_sqrt(float):                              # @_sqrt(float)
        vsqrtss xmm0, xmm0, xmm0
        ret

虽然为_sqrt生成的代码很好,但与std_fma(依赖编译器)相比,_fma中没有不必要的vxorps(将绝对未使用的xmm3寄存器设置为零)和movss指令.内部std :: fma)

while the generated code for _sqrt is fine, there are unnecessary vxorps (which sets the absolutely unused xmm3 register to zero) and movss instructions in _fma compared to std_fma (which rely on compiler intrinsic std::fma)

GCC 6.2生成的程序集,带有-march = x86-64 -mfma -O3

the GCC 6.2 generated assembly with -march=x86-64 -mfma -O3

std_fma(float, float, float):
        vfmadd132ss     xmm0, xmm2, xmm1
        ret
_fma(float, float, float):
        vinsertps       xmm1, xmm1, xmm1, 0xe
        vinsertps       xmm2, xmm2, xmm2, 0xe
        vinsertps       xmm0, xmm0, xmm0, 0xe
        vfmadd132ss     xmm0, xmm2, xmm1
        ret
_sqrt(float):
        vinsertps       xmm0, xmm0, xmm0, 0xe
        vsqrtss xmm0, xmm0, xmm0
        ret

这是很多不必要的vinsertps说明

工作示例: https://godbolt.org/g/q1BQym

Working example: https://godbolt.org/g/q1BQym

默认的x64调用约定在XMM寄存器中传递浮点函数参数,因此应删除这些vmovssvinsertps指令.为什么提到的编译器仍然发出它们?无需内联汇编就可以摆脱它们吗?

The default x64 calling convention pass floating-point function arguments in XMM registers, so those vmovss and vinsertps instructions should be eliminated. Why do the mentioned compilers still emit them? Is it possible to get rid of them without inline assembly?

我还尝试使用_mm_cvtss_f32代替_mm_store_ss和多种调用约定,但没有任何改变.

I also tried to use _mm_cvtss_f32 instead of _mm_store_ss and multiple calling conventions, but nothing changed.

推荐答案

我是根据评论,一些讨论和我自己的经验写这个答案的.

I write this answer based on the comments, some discussion and my own experiences.

正如Ross Ridge在评论中指出的那样,编译器不够聪明,无法识别出仅使用XMM寄存器的最低浮点元素,因此它将那些vxorpsvxorps的其他三个元素归零. c5>说明.绝对没有必要,但是您该怎么办?

As Ross Ridge pointed out in the comments, the compiler is not smart enough to recognize that only the lowest floating-point element of the XMM register is used, so it do zero out the other three elements with those vxorps vinsertps instructions. This is absolutely unnecessary, but what can you do?

需要注意的是, clang 3.9 在生成Intel内部函数的程序集方面比 GCC 6.2(或当前的7.0快照)要好得多,因为它仅在<在我的示例中.我还测试了更多的内在函数,并且在大多数情况下 clang 可以完美地发出单个指令.

Need to note that clang 3.9 does much better job than GCC 6.2 (or current snapshot of 7.0) at generating assembly for Intel intrinsics, since it only fails at _mm_fmadd_ss in my example. I tested more intrinsics as well and in most cases clang did perfect job to emit single instructions.

您可以做什么

您可以使用标准的<cmath>函数,希望它们在有适当的CPU指令的情况下被定义为编译器内在函数.

You can use the standard <cmath> functions, with the hope that they are defined as compiler intrinsics if a proper CPU instructions is available.

这还不够

GCC 这样的编译器通过对NaN和infinities的特殊处理来实现这些功能.因此,除了内在函数外,它们还可以进行一些比较,分支和可能的errno标志处理.

Compilers, like GCC implement these functions with special handling of NaN and infinities. So in addition to the intrinsics, they can do some comparison, branching, and possible errno flag handling.

编译器标志-fno-math-errno -fno-trapping-math确实有助于 GCC clang 消除其他浮点特殊情况,并且

Compiler flags -fno-math-errno -fno-trapping-math do help GCC and clang to eliminate the additional floating-point special cases and errno handling, so they can emit single instructions if possible: https://godbolt.org/g/LZJyaB.

您可以使用-ffast-math实现相同的功能,因为它还包含上述标志,但是它包括,而且不希望有这些(例如不安全的数学优化).

You can achieve the same with -ffast-math, since it also includes the above flags, but it includes much more than that, and those (like unsafe math optimizations) are probably not desired.

不幸的是,这不是便携式解决方案. 它在大多数情况下都可以使用(请参见"godbolt"链接),但仍然取决于实现.

Unfortunately this is not a portable solution. It works in most cases (see the godbolt link), but still, you depend on the implementation.

还有什么

您仍然可以使用内联汇编,它也不是可移植的,更加棘手,还有更多需要考虑的事情.尽管如此,对于这样简单的单行指令,它还是可以的.

You can yet use inline assembly, which is also not portable, much more tricky and there are much more things to consider. In spite of that, for such simple one-line instructions it can be okay.

注意事项:

第一个 GCC / clang Visual Studio 使用不同的语法进行内联汇编,而Visual Studio不会不允许在x64模式下使用.

1st GCC/clang and Visual Studio use different syntax for inline assembly, and Visual Studio doesn't allow it in x64 mode.

第二个您需要针对AVX目标发出VEX编码的指令(3个op变体,例如vsqrtss xmm0 xmm1 xmm2),对于非预发行版,需要发出非VEX编码的2个op变体,例如sqrtss xmm0 xmm1 -AVX CPU. VEX编码指令是3个操作数指令,因此它们为编译器提供了更大的自由度来进行优化.要利用它们,必须正确设置注册输入/输出参数 .这样的工作就可以完成.

2nd You need to emit VEX encoded instructions (3 op variants, e.g. vsqrtss xmm0 xmm1 xmm2) for AVX targets, and non-VEX encoded (2 op variants, e.g. sqrtss xmm0 xmm1) variants for pre-AVX CPUs. VEX encoded instructions are 3 operand instructions, so they offer more freedom for the compiler to optimize. To take their advantage, register input/output parameters must be set properly. So something like below does the job.

#   if __AVX__
    asm("vsqrtss %1, %1, %0" :"=x"(x) : "x"(x));
#   else
    asm("sqrtss %1, %0" :"=x"(x) : "x"(x));
#   endif

但是以下是VEX的不良技巧:

But the following is a bad technique for VEX:

asm("vsqrtss %1, %1, %0" :"+x"(x));

它可能会产生不必要的移动指令,请检查 https://godbolt.org/g/VtNMLL.

It can yield to an unnecessary move instruction, check https://godbolt.org/g/VtNMLL.

第三届正如彼得·科德斯(Peter Cordes)所指出的,您可能会失去常见的子表达式消除( CSE)恒定折叠(恒定传播)用于内联汇编功能.但是,如果未将内联asm声明为volatile,则编译器可以将其视为仅依赖于其输入的纯函数,并执行公共子表达式消除,这很好.

3rd As Peter Cordes pointed out, you can lose common subexpression elimination (CSE) and constant folding (constant propagation) for inline assembly functions. However if the inline asm is not declared as volatile, the compiler can treat it as a pure function which depends only on its inputs and perform common subexpression elimination, which is great.

如彼得所说:

"不要使用嵌入式asm " 是'绝对的规则,那只是你的事 使用前应了解并仔细考虑.如果 替代方案不符合您的要求,并且您最终不会获得 内联到无法优化的地方,然后继续

"Don't use inline asm" isn't an absolute rule, it's just something you should be aware of and consider carefully before using. If the alternatives don't meet your requirements, and you don't end up with this inlining into places where it can't optimize, then go right ahead.

这篇关于编译器为内在函数生成的汇编的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆