适用于AVX512掩码寄存器(k1 ... k7)的GNU C内联asm输入约束? [英] GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?

查看:104
本文介绍了适用于AVX512掩码寄存器(k1 ... k7)的GNU C内联asm输入约束?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

AVX512为其算术命令引入了opmask功能.一个简单的示例: godbolt.org .

AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org.

#include <immintrin.h>
__m512i add(__m512i a, __m512i b) {
    __m512i sum;
    asm(
        "mov ebx, 0xAAAAAAAA;                                   \n\t"
        "kmovw k1, ebx;                                         \n\t"
        "vpaddd %[SUM] %{k1%}%{z%}, %[A], %[B];  # conditional add   "
        :   [SUM]   "=v"(sum)
        :   [A]     "v" (a),
            [B]     "v" (b)
        : "ebx", "k1"  // clobbers
       );
    return sum;
}


-march=skylake-avx512 -masm=intel -O3


 mov ebx,0xaaaaaaaa
 kmovw k1,ebx
 vpaddd zmm0{k1}{z},zmm0,zmm1


问题是必须指定k1.


The problem is that k1 has to be specified.

是否存在像"r" 这样的整数输入约束,只是它选择了 k 寄存器而不是通用寄存器?

Is there an input constraint like "r" for integers except that it picks a k register instead of a general-purpose register?

推荐答案

__ mmask16 实际上是 unsigned short 的typedef(以及其他普通整数类型的其他mask类型),因此我们只需要一个约束即可将其传递给 k 寄存器.

__mmask16 is literally a typedef for unsigned short (and other mask types for other plain integer types), so we just need a constraint for passing it in a k register.

我们必须深入挖掘gcc来源

We have to go digging in the gcc sources config/i386/constraints.md to find it:

任何 掩码寄存器的约束为"k" .或对 k1..k7 (可以用作谓词,与 k0 不同)使用"Yk" .strong>例如,您将使用"= k" 操作数作为比对掩码的目标.

The constraint for any mask register is "k". Or use "Yk" for k1..k7 (which can be used as a predicate, unlike k0). You'd use an "=k" operand as the destination for a compare-into-mask, for example.

很明显,您可以将"= Yk"(tmp) __ mmask16 tmp 结合使用,以使编译器为您完成寄存器分配,而不仅仅是在您决定使用哪个"k" 注册.

Obviously you can use "=Yk"(tmp) with a __mmask16 tmp to get the compiler to do register allocation for you, instead of just declaring clobbers on whichever "k" registers you decide to use.

首先, https://gcc.gnu.org/wiki/DontUseInlineAsm (如果可以避免的话).理解 asm很棒,但是可以使用它来读取编译器输出和/或找出最佳选择,然后编写可以编译所需方式的内在函数.性能调整信息,例如 https://agner.org/optimize/ https://software.intel.com/sites/landingpage/IntrinsicsGuide/上找到内在函数a>

First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Understanding asm is great, but use that to read compiler output and/or figure out what would be optimal, then write intrinsics that can compile the way you want. Performance tuning info like https://agner.org/optimize/ and https://uops.info/ list things by asm mnemonic, and they're shorter / easier to remember than intrinsics, but you can search by mnemonic to find intrinsics on https://software.intel.com/sites/landingpage/IntrinsicsGuide/

内部函数还将使编译器将负载折叠到内存源操作数中,以用于其他指令;使用AVX512,甚至可以广播负载!内联汇编程序强制编译器使用单独的装入指令.即使输入"vm" ,也不会让编译器选择广播负载作为内存源,因为它不知道指令的广播元素宽度(s)您曾经在使用它.

Intrinsics will also let the compiler fold loads into memory source operands for other instructions; with AVX512 those can even be broadcast loads! Your inline asm forces the compiler to use a separate load instruction. Even a "vm" input won't let the compiler pick a broadcast-load as the memory source, because it wouldn't know the broadcast element width of the instruction(s) you were using it with.

使用 _mm512_mask_add_epi32 _mm512_maskz_add_epi32 ,尤其是如果您已经在使用 __ m512i 类型的<immintrin.h> .

此外,您的asm有一个错误:您使用的是 {k1} 合并掩码而不是 {k1} {z} 零掩码,但您使用未初始化的 __ m512i sum; 和仅输出"= v" 约束作为合并目标!作为独立功能,它恰巧合并到 a 中,因为调用约定的格式为ZMM0 =第一个输入=返回值寄存器.但是当内联到其他函数中时,您绝对不能假定 sum 将选择与 a 相同的寄存器.最好的选择是对"+ v"(a)使用读/写操作数,并以此作为目标和第一个来源.

Also, your asm has a bug: you're using {k1} merge-masking not {k1}{z} zero-masking, but you used uninitialized __m512i sum; with an output-only "=v" constraint as the merge destination! As a stand-alone function, it happens to merge into a because the calling convention has ZMM0 = first input = return value register. But when inlining into other functions, you definitely can't assume that sum will pick the same register as a. Your best bet is to use a read/write operand for "+v"(a) and use is as the destination and first source.

合并屏蔽仅对"+ v" 读/写操作数有意义.(或在具有多个指令的asm语句中,您已经编写了该指令输出一次,并希望将另一个结果合并到其中.)

Merge-masking only makes sense with a "+v" read/write operand. (Or in an asm statement with multiple instructions where you've already written an output once, and want to merge another result into it.)

内在的行为会阻止您犯此错误;合并掩码版本为合并目标提供了额外的输入.(asm目标操作数).

Intrinsics would stop you from making this mistake; the merge-masking version has an extra input for the merge-target. (The asm destination operand).

// works with -march=skylake-avx512 or -march=knl
// or just -mavx512f but don't do that.
// also needed: -masm=intel
#include <immintrin.h>
__m512i add_zmask(__m512i a, __m512i b) {
    __m512i sum;
    asm(
        "vpaddd %[SUM] %{%[mask]%}%{z%}, %[A], %[B];  # conditional add   "
        :   [SUM]   "=v"(sum)
        :   [A]     "v" (a),
            [B]     "v" (b),
            [mask]  "Yk" ((__mmask16)0xAAAA)
         // no clobbers needed, unlike your question which I fixed with an edit
       );
    return sum;
}

请注意,所有 {} 均以(

Note that all the { and } are escaped with % (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Special-format-strings), so they're not parsed as dialect-alternatives {AT&T | Intel-syntax}.

此代码最早可在4.9时使用gcc进行编译,但实际上不这样做,因为它不了解 -march = skylake-avx512 ,甚至没有针对Skylake或KNL进行调整的设置.使用最新的GCC来了解您的CPU,以获得最佳结果.

This compiles with gcc as early as 4.9, but don't actually do that because it doesn't understand -march=skylake-avx512, or even have tuning settings for Skylake or KNL. Use a more recent GCC that knows about your CPU for best results.

<强> Godbolt编译器资源管理器 :在

# gcc8.3 -O3 -march=skylake-avx512 or -march=knl  (and -masm=intel)
add(long long __vector, long long __vector):
        mov     eax, -21846
        kmovw   k1, eax         # compiler-generated
       # inline asm starts
        vpaddd zmm0 {k1}{z}, zmm0, zmm1;  # conditional add   
       # inline asm ends
        ret

-mavx512bw (由 -march = skylake-avx512 表示,但不是 knl )是必需的.Yk" 可以在 int 上使用.如果使用 -march = knl 进行编译,则整数文字需要转换为 __ mmask16 __ mask8 ,因为 unsigned int = __mask32不适用于蒙版.

-mavx512bw (implied by -march=skylake-avx512 but not knl) is required for "Yk" to work on an int. If you're compiling with -march=knl, integer literals need a cast to __mmask16 or __mask8, because unsigned int = __mask32 isn't available for masks.

[mask]"Yk"(0xAAAA)需要AVX512BW,即使该常数确实适合16位,也只是因为裸整数文字始终具有 int 类型.( vpaddd zmm每个向量有16个元素,因此我将常数缩短为16位.)使用AVX512BW,您可以传递更宽的常数,也可以省略更小的常数.

[mask] "Yk" (0xAAAA) requires AVX512BW even though the constant does fit in 16 bits, just because bare integer literals always have type int. (vpaddd zmm has 16 elements per vector, so I shortened your constant to 16-bit.) With AVX512BW, you can pass wider constants or leave out the cast for narrow ones.

  • gcc6和更高版本支持 -march = skylake-avx512 .使用它来设置调整并启用所有功能.优选gcc8或至少gcc7.如果您曾经在内联asm之外使用新的ISA扩展(例如AVX512),则较新的编译器会生成不那么笨拙的代码.
  • gcc5支持 -mavx512f -mavx512bw ,但不了解Skylake.
  • gcc4.9不支持 -mavx512bw .
  • gcc6 and later support -march=skylake-avx512. Use that to set tuning as well as enabling everything. Preferably gcc8 or at least gcc7. Newer compilers generate less clunky code with new ISA extensions like AVX512 if you're ever using it outside of inline asm.
  • gcc5 supports -mavx512f -mavx512bw but doesn't know about Skylake.
  • gcc4.9 doesn't support -mavx512bw.

"Yk" 尚未记录在感谢Ross在 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆