在内联asm中使用特定的zmm寄存器 [英] Using a specific zmm register in inline asm

查看:287
本文介绍了在内联asm中使用特定的zmm寄存器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以告诉 gcc风格的内联程序集__m512i变量放入特定 zmm寄存器中,例如zmm31?

Can I tell gcc-style inline assembly to put my __m512i variable into a specific zmm register, like zmm31?

推荐答案

对于完全没有特定寄存器约束的目标(如ARM),请使用

Like on targets where the are no specific-register constraints at all (like ARM), use local register variables to get broad constraints to pick a specific register for asm statements. The compiler can still optimize otherwise, because the only documented guaranteed effect of a register-local is for asm inputs/outputs.

即使没有asm,编译器也会首选指定的寄存器. (因此,您可以编写看似有效的代码,但对于诸如register int ebx asm("ebx"); return ebx;之类的代码通常并不安全.GCC文档使行为得以保证/面向未来,即使当前的gcc更愿意使用指定的寄存器来浪费足够的时间也是如此.约束与指定寄存器不兼容时的说明,请参见下文.)

The compiler will prefer the specified register even if there's no asm, though. (So you can write code that appears to work but isn't safe in general with stuff like register int ebx asm("ebx"); return ebx;. GCC documentation is what makes a behaviour guaranteed / future-proof, even if current gcc prefers using the specified register strongly enough to waste instructions when the constraint isn't compatible with the specified register, see below.)

无论如何,这种使用注册自动取款机 local 变量是它们保证可以使用的 only :

Anyway, this use of register-asm local variables is the only thing they're guaranteed to work for:

#include <immintrin.h>
__m512i foo() {
    register __m512i z31 asm("zmm31") = _mm512_set1_epi32(123);
    register __m512i z30 asm("zmm30");

    asm("vmovdqa64 %1, %0  # from inline asm"
        : "=v"(z30)
        : "v"(z31)
       );
    return z30;
}

On the Godbolt compiler explorer, compiles to this with clang6.0:

    # clang -O3 -march=skylake-avx512
    vbroadcastss    .LCPI0_0(%rip), %zmm31 # zmm31 = [1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43]
    vmovdqa64       %zmm31, %zmm30        # from inline asm
    vmovaps %zmm30, %zmm0
    retq

和gcc8.2:

# gcc -O3 -march=skylake-avx512
foo():
    movl    $123, %eax
    vpbroadcastd    %eax, %zmm31
    vmovdqa64 %zmm31, %zmm30  # from inline asm
    vmovdqa64       %zmm30, %zmm0
    ret


请注意"v"约束,该约束允许任何EVEX向量寄存器(0..31),而"x"仅允许前16个."x"被记录为任何SSE寄存器" ",但也适用于AVX YMM寄存器. https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints. html .


Note the "v" constraints which allow any EVEX vector register (0..31), unlike "x" which only allows the first 16. "x" is documented as "any SSE register", but also applies to AVX YMM registers. https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.

为此使用"x"不会导致任何警告,但是在gcc "x"胜过寄存器变量声明的情况下,因此选择了%zmm2和%zmm1(奇怪的是不是zmm0,所以需要多做一些动作是必需的).因此,register-asm声明确实使我们付出了效率.

Using "x" for this didn't result in any warnings, but with gcc "x" won vs. the register-variable declaration, so it chose %zmm2 and %zmm1 (strangely not zmm0 so an extra move was required). The register-asm declaration thus did cost us efficiency.

使用clang仍然使用zmm31和zmm30,显然违反了"x"约束,因此,如果您在寄存器操作数的XMM或YMM部分使用了没有EVEX版本的指令,它将无法汇编AVX2 vpcmpeqd ymm,ymm,ymm (比较向量,不要比较掩码). (在GNU C内联汇编中,单个操作数的xmm/ymm/zmm修饰符是什么?).

With clang it still used zmm31 and zmm30, apparently violating the "x" constraint, so it would have failed to assemble if you'd used an instruction with no EVEX version on the XMM or YMM part of the register operand, like AVX2 vpcmpeqd ymm,ymm,ymm (compare into vector, not compare into mask). (In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?).

//#ifndef __clang__
__m512i broken_with_clang() {
    register __m512i z31 asm("zmm31") = _mm512_set1_epi32(123);
    register __m512i z30 asm("zmm30") = _mm512_setzero_si512();
    // notice that gcc still inits these in zmm31 and 30, *then* copies
    // so register asm costs us efficiency.

    // AVX512 only has compares into k registers, not into YMM registers.
    asm("vpcmpeqd %t1, %t0, %t0  # from inline asm. input was %0"
        : "+x"(z30)
        : "x"(z31)
       );
    return z30;
}
//#endif

使用clang时,每个操作数都会出错;我猜clang不支持t修饰符来获取寄存器的YMM名称(因为即使我完全删除了register ... asm()东西,它也无法在clang6.0中使用.)

With clang we get an error for each operand; I guess clang doesn't support t modifiers to get the YMM name of the register (because it fails with clang6.0 even if I remove the register ... asm() stuff entirely.)

<source>:21:9: error: invalid operand in inline asm: 'vpcmpeqd ${1:t}, ${0:t}, ${0:t}  # from inline asm. input was $0'
    asm("vpcmpeqd %t1, %t0, %t0  # from inline asm. input was %0"
        ^
...
<source>:21:9: error: unknown token in expression
<inline asm>:1:11: note: instantiated into assembly here
        vpcmpeqd , ,   # from inline asm. input was %zmm30

但是gcc编译就可以了:

But gcc compiles it just fine:

broken_with_clang():
    movl    $123, %eax
    vpbroadcastd    %eax, %zmm31
    vpxord  %xmm30, %xmm30, %xmm30

    vmovdqa64       %zmm30, %zmm1    # extra overhead because of register asm
    vmovdqa64       %zmm31, %zmm2    # which didn't match the constraints

    vpcmpeqd %ymm2, %ymm1, %ymm1  # from inline asm. input was %zmm1

    vmovdqa64       %zmm1, %zmm0     # extra overhead because gcc didn't pick zmm0
    ret

这篇关于在内联asm中使用特定的zmm寄存器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆