使用矢量扩展时,让GCC生成PTEST指令 [英] Getting GCC to generate a PTEST instruction when using vector extensions
问题描述
使用C的GCC向量扩展时,如何检查vector上的所有值为零?
例如:
#include< stdint.h>
typedef uint32_t v8ui __attribute__((vector_size(32)));
v8ui *
foo(v8ui * mem){
v8ui v; (v8ui){1,1,1,1,1,1,1,1};
v [0] || v [1] || v [2] ||
v [3] || v [4] || v [5] || v [6] || v [7];
mem ++)
v& = *(mem);
return mem;
SSE4.2有 PTEST
指令,它允许运行一个像条件用作的条件的测试,但由GCC生成的代码只是解开向量并逐个检查单个元素:
.L2:
vandps(%rax),%ymm1,%ymm1
vmovdqa%xmm1, %xmm0
addq $ 32,%rax
vmovd%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 1,%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 2,%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 3,%xmm0,%edx
testl%edx,%edx
jne .L2
vextractf128 $ 0x1,%ymm1,%xmm0
vmovd%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 1,%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 2,%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 3,%xmm0,%edx
testl%edx ,%edx
jne .L2
vzeroupper
ret
是有什么办法可以让GCC生成一个有效的测试,而不会恢复到使用内部函数?
更新:作为参考,使用不可移植GCC内建(V)PTEST
:
typedef uint32_t v8ui __attribute__( (vector_size(32)));
typedef long long int v4si __attribute__((vector_size(32)));
const v8ui ones = {1,1,1,1,1,1,1};
v8ui *
foo(v8ui * mem){
v8ui v; (v = one;
!__ builtin_ia32_ptestz256((v4si)v,
(v4si)ones);
mem ++)
v& = *(mem);
。
return mem;
}
gcc 4.9.2 -O3 -mavx2
(在64位模式下)没有意识到它可以使用 ptest
| |
。
vmovd
和 vpextrd
提取矢量元素,并将7 或
在32位寄存器之间复位。所以它很糟糕,并没有利用任何简化仍然会产生相同的逻辑真值。 ||
test
/ jne
所以在这一点上,你不能指望GCC识别这样的测试,并且做任何远程高效的测试。 ( pcmpeq
/ movmsk
/ test
是另一个序列,不坏,但海湾合作委员会不会产生这一点。)
When using the GCC vector extensions for C, how can I check that all the values on a vector are zero?
For instance:
#include <stdint.h>
typedef uint32_t v8ui __attribute__ ((vector_size (32)));
v8ui*
foo(v8ui *mem) {
v8ui v;
for ( v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };
v[0] || v[1] || v[2] || v[3] || v[4] || v[5] || v[6] || v[7];
mem++)
v &= *(mem);
return mem;
}
SSE4.2 has the PTEST
instruction which allows to run a test like the one used as the for
condition but the code generated by GCC just unpacks the vector and checks the single elements one by one:
.L2:
vandps (%rax), %ymm1, %ymm1
vmovdqa %xmm1, %xmm0
addq $32, %rax
vmovd %xmm0, %edx
testl %edx, %edx
jne .L2
vpextrd $1, %xmm0, %edx
testl %edx, %edx
jne .L2
vpextrd $2, %xmm0, %edx
testl %edx, %edx
jne .L2
vpextrd $3, %xmm0, %edx
testl %edx, %edx
jne .L2
vextractf128 $0x1, %ymm1, %xmm0
vmovd %xmm0, %edx
testl %edx, %edx
jne .L2
vpextrd $1, %xmm0, %edx
testl %edx, %edx
jne .L2
vpextrd $2, %xmm0, %edx
testl %edx, %edx
jne .L2
vpextrd $3, %xmm0, %edx
testl %edx, %edx
jne .L2
vzeroupper
ret
Is there any way to get GCC to generate an efficient test for that without reverting to using intrinsics?
Update: For reference, code using an unportable GCC builtin for (V)PTEST
:
typedef uint32_t v8ui __attribute__ ((vector_size (32)));
typedef long long int v4si __attribute__ ((vector_size (32)));
const v8ui ones = { 1, 1, 1, 1, 1, 1, 1, 1 };
v8ui*
foo(v8ui *mem) {
v8ui v;
for ( v = ones;
!__builtin_ia32_ptestz256((v4si)v,
(v4si)ones);
mem++)
v &= *(mem);
return mem;
}
gcc 4.9.2 -O3 -mavx2
(in 64bit mode) didn't realize it could use ptest
for this, with either ||
or |
.
The |
version extracts the vector elements with vmovd
and vpextrd
, and combines things with 7 or
insns between 32bit registers. So it's pretty bad, and doesn't take advantage of any simplifications that will still produce the same logical truth value.
The ||
version is just as bad, and does the same extract-an-element-at-a-time, but does a test
/ jne
for every one.
So at this point, you can't count on GCC recognizing tests like this and doing anything remotely efficient. (pcmpeq
/ movmsk
/ test
is another sequence that wouldn't be bad, but gcc doesn't generate that either.)
这篇关于使用矢量扩展时,让GCC生成PTEST指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!