使用矢量扩展时,让GCC生成PTEST指令 [英] Getting GCC to generate a PTEST instruction when using vector extensions

查看:271
本文介绍了使用矢量扩展时,让GCC生成PTEST指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用C的GCC向量扩展时,如何检查vector上的所有值为零?

例如:

  #include< stdint.h> 

typedef uint32_t v8ui __attribute__((vector_size(32)));

v8ui *
foo(v8ui * mem){
v8ui v; (v8ui){1,1,1,1,1,1,1,1};
v [0] || v [1] || v [2] ||
v [3] || v [4] || v [5] || v [6] || v [7];
mem ++)
v& = *(mem);

return mem;

SSE4.2有 PTEST 指令,它允许运行一个像条件用作的条件的测试,但由GCC生成的代码只是解开向量并逐个检查单个元素:

  .L2:
vandps(%rax),%ymm1,%ymm1
vmovdqa%xmm1, %xmm0
addq $ 32,%rax
vmovd%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 1,%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 2,%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 3,%xmm0,%edx
testl%edx,%edx
jne .L2
vextractf128 $ 0x1,%ymm1,%xmm0
vmovd%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 1,%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 2,%xmm0,%edx
testl%edx,%edx
jne .L2
vpextrd $ 3,%xmm0,%edx
testl%edx ,%edx
jne .L2
vzeroupper
ret

是有什么办法可以让GCC生成一个有效的测试,而不会恢复到使用内部函数?

更新:作为参考,使用不可移植GCC内建(V)PTEST

  typedef uint32_t v8ui __attribute__( (vector_size(32))); 
typedef long long int v4si __attribute__((vector_size(32)));

const v8ui ones = {1,1,1,1,1,1,1};

v8ui *
foo(v8ui * mem){
v8ui v; (v = one;
!__ builtin_ia32_ptestz256((v4si)v,
(v4si)ones);
mem ++)
v& = *(mem);


return mem;
}


解决方案

gcc 4.9.2 -O3 -mavx2 (在64位模式下)没有意识到它可以使用 ptest | |

版本通过 vmovd vpextrd 提取矢量元素,并将7 在32位寄存器之间复位。所以它很糟糕,并没有利用任何简化仍然会产生相同的逻辑真值。



||

code>版本同样糟糕,并且每次都提取一个元素,但是做了一个 test / jne


所以在这一点上,你不能指望GCC识别这样的测试,并且做任何远程高效的测试。 ( pcmpeq / movmsk / test 是另一个序列,不坏,但海湾合作委员会不会产生这一点。)


When using the GCC vector extensions for C, how can I check that all the values on a vector are zero?

For instance:

#include <stdint.h>

typedef uint32_t v8ui __attribute__ ((vector_size (32)));

v8ui*
foo(v8ui *mem) {
    v8ui v;
    for ( v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };
          v[0] || v[1] || v[2] || v[3] || v[4] || v[5] || v[6] || v[7];
          mem++)
        v &= *(mem);

    return mem;
}

SSE4.2 has the PTEST instruction which allows to run a test like the one used as the for condition but the code generated by GCC just unpacks the vector and checks the single elements one by one:

.L2:
        vandps  (%rax), %ymm1, %ymm1
        vmovdqa %xmm1, %xmm0
        addq    $32, %rax
        vmovd   %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $1, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $2, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $3, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vextractf128    $0x1, %ymm1, %xmm0
        vmovd   %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $1, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $2, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $3, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vzeroupper
        ret

Is there any way to get GCC to generate an efficient test for that without reverting to using intrinsics?

Update: For reference, code using an unportable GCC builtin for (V)PTEST:

typedef uint32_t v8ui __attribute__ ((vector_size (32)));
typedef long long int v4si __attribute__ ((vector_size (32)));

const v8ui ones = { 1, 1, 1, 1, 1, 1, 1, 1 };

v8ui*
foo(v8ui *mem) {
    v8ui v;
    for ( v = ones;
          !__builtin_ia32_ptestz256((v4si)v,
                                    (v4si)ones);
          mem++)
        v &= *(mem);

    return mem;
}

解决方案

gcc 4.9.2 -O3 -mavx2 (in 64bit mode) didn't realize it could use ptest for this, with either || or |.

The | version extracts the vector elements with vmovd and vpextrd, and combines things with 7 or insns between 32bit registers. So it's pretty bad, and doesn't take advantage of any simplifications that will still produce the same logical truth value.

The || version is just as bad, and does the same extract-an-element-at-a-time, but does a test / jne for every one.

So at this point, you can't count on GCC recognizing tests like this and doing anything remotely efficient. (pcmpeq / movmsk / test is another sequence that wouldn't be bad, but gcc doesn't generate that either.)

这篇关于使用矢量扩展时,让GCC生成PTEST指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆