适用于AVX512掩码寄存器(k1 ... k7)的GNU C内联asm输入约束? [英] GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?
问题描述
AVX512为其算术命令引入了opmask功能.一个简单的示例: godbolt.org .
AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org.
#include <immintrin.h>
__m512i add(__m512i a, __m512i b) {
__m512i sum;
asm(
"mov ebx, 0xAAAAAAAA; \n\t"
"kmovw k1, ebx; \n\t"
"vpaddd %[SUM] %{k1%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b)
: "ebx", "k1" // clobbers
);
return sum;
}
-march=skylake-avx512 -masm=intel -O3
mov ebx,0xaaaaaaaa
kmovw k1,ebx
vpaddd zmm0{k1}{z},zmm0,zmm1
问题是必须指定k1.
The problem is that k1 has to be specified.
是否存在像"r"
这样的整数输入约束,只是它选择了 k
寄存器而不是通用寄存器?
Is there an input constraint like "r"
for integers except that it picks a k
register instead of a general-purpose register?
推荐答案
__ mmask16
实际上是 unsigned short
的typedef(以及其他普通整数类型的其他mask类型),因此我们只需要一个约束即可将其传递给 k
寄存器.
__mmask16
is literally a typedef for unsigned short
(and other mask types for other plain integer types), so we just need a constraint for passing it in a k
register.
We have to go digging in the gcc sources config/i386/constraints.md
to find it:
任何 掩码寄存器的约束为"k"
.或对 k1..k7
(可以用作谓词,与 k0
不同)使用"Yk"
.strong>例如,您将使用"= k"
操作数作为比对掩码的目标.
The constraint for any mask register is "k"
. Or use "Yk"
for k1..k7
(which can be used as a predicate, unlike k0
). You'd use an "=k"
operand as the destination for a compare-into-mask, for example.
很明显,您可以将"= Yk"(tmp)
与 __ mmask16 tmp
结合使用,以使编译器为您完成寄存器分配,而不仅仅是在您决定使用哪个"k"
注册.
Obviously you can use "=Yk"(tmp)
with a __mmask16 tmp
to get the compiler to do register allocation for you, instead of just declaring clobbers on whichever "k"
registers you decide to use.
首先, https://gcc.gnu.org/wiki/DontUseInlineAsm (如果可以避免的话).理解 asm很棒,但是可以使用它来读取编译器输出和/或找出最佳选择,然后编写可以编译所需方式的内在函数.性能调整信息,例如 https://agner.org/optimize/和 https://software.intel.com/sites/landingpage/IntrinsicsGuide/上找到内在函数a>
First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Understanding asm is great, but use that to read compiler output and/or figure out what would be optimal, then write intrinsics that can compile the way you want. Performance tuning info like https://agner.org/optimize/ and https://uops.info/ list things by asm mnemonic, and they're shorter / easier to remember than intrinsics, but you can search by mnemonic to find intrinsics on https://software.intel.com/sites/landingpage/IntrinsicsGuide/
内部函数还将使编译器将负载折叠到内存源操作数中,以用于其他指令;使用AVX512,甚至可以广播负载!内联汇编程序强制编译器使用单独的装入指令.即使输入"vm"
,也不会让编译器选择广播负载作为内存源,因为它不知道指令的广播元素宽度(s)您曾经在使用它.
Intrinsics will also let the compiler fold loads into memory source operands for other instructions; with AVX512 those can even be broadcast loads! Your inline asm forces the compiler to use a separate load instruction. Even a "vm"
input won't let the compiler pick a broadcast-load as the memory source, because it wouldn't know the broadcast element width of the instruction(s) you were using it with.
使用 _mm512_mask_add_epi32
或 _mm512_maskz_add_epi32
,尤其是如果您已经在使用 __ m512i
类型的<immintrin.h>
.
此外,您的asm有一个错误:您使用的是 {k1}
合并掩码而不是 {k1} {z}
零掩码,但您使用未初始化的 __ m512i sum;
和仅输出"= v"
约束作为合并目标!作为独立功能,它恰巧合并到 a
中,因为调用约定的格式为ZMM0 =第一个输入=返回值寄存器.但是当内联到其他函数中时,您绝对不能假定 sum
将选择与 a
相同的寄存器.最好的选择是对"+ v"(a)
使用读/写操作数,并以此作为目标和第一个来源.
Also, your asm has a bug: you're using {k1}
merge-masking not {k1}{z}
zero-masking, but you used uninitialized __m512i sum;
with an output-only "=v"
constraint as the merge destination! As a stand-alone function, it happens to merge into a
because the calling convention has ZMM0 = first input = return value register. But when inlining into other functions, you definitely can't assume that sum
will pick the same register as a
. Your best bet is to use a read/write operand for "+v"(a)
and use is as the destination and first source.
合并屏蔽仅对"+ v"
读/写操作数有意义.(或在具有多个指令的asm语句中,您已经编写了该指令输出一次,并希望将另一个结果合并到其中.)
Merge-masking only makes sense with a "+v"
read/write operand. (Or in an asm statement with multiple instructions where you've already written an output once, and want to merge another result into it.)
内在的行为会阻止您犯此错误;合并掩码版本为合并目标提供了额外的输入.(asm目标操作数).
Intrinsics would stop you from making this mistake; the merge-masking version has an extra input for the merge-target. (The asm destination operand).
// works with -march=skylake-avx512 or -march=knl
// or just -mavx512f but don't do that.
// also needed: -masm=intel
#include <immintrin.h>
__m512i add_zmask(__m512i a, __m512i b) {
__m512i sum;
asm(
"vpaddd %[SUM] %{%[mask]%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b),
[mask] "Yk" ((__mmask16)0xAAAA)
// no clobbers needed, unlike your question which I fixed with an edit
);
return sum;
}
Note that all the {
and }
are escaped with %
(https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Special-format-strings), so they're not parsed as dialect-alternatives {AT&T | Intel-syntax}
.
此代码最早可在4.9时使用gcc进行编译,但实际上不这样做,因为它不了解 -march = skylake-avx512
,甚至没有针对Skylake或KNL进行调整的设置.使用最新的GCC来了解您的CPU,以获得最佳结果.
This compiles with gcc as early as 4.9, but don't actually do that because it doesn't understand -march=skylake-avx512
, or even have tuning settings for Skylake or KNL. Use a more recent GCC that knows about your CPU for best results.
# gcc8.3 -O3 -march=skylake-avx512 or -march=knl (and -masm=intel)
add(long long __vector, long long __vector):
mov eax, -21846
kmovw k1, eax # compiler-generated
# inline asm starts
vpaddd zmm0 {k1}{z}, zmm0, zmm1; # conditional add
# inline asm ends
ret
-mavx512bw
(由 -march = skylake-avx512
表示,但不是 knl
)是必需的.Yk" 可以在 int
上使用.如果使用 -march = knl
进行编译,则整数文字需要转换为 __ mmask16
或 __ mask8
,因为 unsigned int = __mask32
不适用于蒙版.
-mavx512bw
(implied by -march=skylake-avx512
but not knl
) is required for "Yk"
to work on an int
. If you're compiling with -march=knl
, integer literals need a cast to __mmask16
or __mask8
, because unsigned int = __mask32
isn't available for masks.
[mask]"Yk"(0xAAAA)
需要AVX512BW,即使该常数确实适合16位,也只是因为裸整数文字始终具有 int
类型.( vpaddd
zmm每个向量有16个元素,因此我将常数缩短为16位.)使用AVX512BW,您可以传递更宽的常数,也可以省略更小的常数.
[mask] "Yk" (0xAAAA)
requires AVX512BW even though the constant does fit in 16 bits, just because bare integer literals always have type int
. (vpaddd
zmm has 16 elements per vector, so I shortened your constant to 16-bit.) With AVX512BW, you can pass wider constants or leave out the cast for narrow ones.
- gcc6和更高版本支持
-march = skylake-avx512
.使用它来设置调整并启用所有功能.优选gcc8或至少gcc7.如果您曾经在内联asm之外使用新的ISA扩展(例如AVX512),则较新的编译器会生成不那么笨拙的代码. - gcc5支持
-mavx512f -mavx512bw
,但不了解Skylake. - gcc4.9不支持
-mavx512bw
.
- gcc6 and later support
-march=skylake-avx512
. Use that to set tuning as well as enabling everything. Preferably gcc8 or at least gcc7. Newer compilers generate less clunky code with new ISA extensions like AVX512 if you're ever using it outside of inline asm. - gcc5 supports
-mavx512f -mavx512bw
but doesn't know about Skylake. - gcc4.9 doesn't support
-mavx512bw
.