使用矢量扩展时，让GCC生成PTEST指令 [英] Getting GCC to generate a PTEST instruction when using vector extensions

查看：271 发布时间：2018/4/20 17:40:45 c gcc vectorization sse avx2

本文介绍了使用矢量扩展时，让GCC生成PTEST指令的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用C的GCC向量扩展时，如何检查vector上的所有值为零？

例如：

  #include< stdint.h> 
 
 typedef uint32_t v8ui __attribute__（（vector_size（32）））; 
 
 v8ui * 
 foo（v8ui * mem）{
 v8ui v; （v8ui）{1,1,1,1,1,1,1,1}; 
v [0] || v [1] || v [2] || 
 v [3] || v [4] || v [5] || v [6] || v [7]; 
 mem ++）
v& = *（mem）; 
 
 return mem;

SSE4.2有 PTEST 指令，它允许运行一个像条件用作的条件的测试，但由GCC生成的代码只是解开向量并逐个检查单个元素：

  .L2：
 vandps（％rax），％ymm1，％ymm1 
 vmovdqa％xmm1， ％xmm0 
 addq $ 32，％rax 
 vmovd％xmm0，％edx 
 testl％edx，％edx 
 jne .L2 
 vpextrd $ 1，％xmm0，％edx 
 testl％edx，％edx 
 jne .L2 
 vpextrd $ 2，％xmm0，％edx 
 testl％edx，％edx 
 jne .L2 
 vpextrd $ 3，％xmm0，％edx 
 testl％edx，％edx 
 jne .L2 
 vextractf128 $ 0x1，％ymm1，％xmm0 
 vmovd％xmm0，％edx 
 testl％edx，％edx 
 jne .L2 
 vpextrd $ 1，％xmm0，％edx 
 testl％edx，％edx 
 jne .L2 
 vpextrd $ 2，％xmm0，％edx 
 testl％edx，％edx 
 jne .L2 
 vpextrd $ 3，％xmm0，％edx 
 testl％edx ，％edx 
 jne .L2 
 vzeroupper 
 ret

是有什么办法可以让GCC生成一个有效的测试，而不会恢复到使用内部函数？

更新：作为参考，使用不可移植GCC内建（V）PTEST ：

  typedef uint32_t v8ui __attribute__（ （vector_size（32）））; 
 typedef long long int v4si __attribute__（（vector_size（32）））; 
 
 const v8ui ones = {1,1,1,1,1,1,1}; 
 
 v8ui * 
 foo（v8ui * mem）{
 v8ui v; （v = one; 
！__ builtin_ia32_ptestz256（（v4si）v，
（v4si）ones）; 
 mem ++）
 v& = *（mem）; 
。 
 
 return mem; 
}

解决方案

gcc 4.9.2 -O3 -mavx2 （在64位模式下）没有意识到它可以使用 ptest | | 。

版本通过 vmovd 和 vpextrd 提取矢量元素，并将7 或在32位寄存器之间复位。所以它很糟糕，并没有利用任何简化仍然会产生相同的逻辑真值。

||
code>版本同样糟糕，并且每次都提取一个元素，但是做了一个 test / jne 所以在这一点上，你不能指望GCC识别这样的测试，并且做任何远程高效的测试。（ pcmpeq / movmsk / test 是另一个序列，不坏，但海湾合作委员会不会产生这一点。） When using the GCC vector extensions for C, how can I check that all the values on a vector are zero? For instance: #include <stdint.h> typedef uint32_t v8ui __attribute__ ((vector_size (32))); v8ui* foo(v8ui *mem) { v8ui v; for ( v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 }; v[0] || v[1] || v[2] || v[3] || v[4] || v[5] || v[6] || v[7]; mem++) v &= *(mem); return mem; } SSE4.2 has the PTEST instruction which allows to run a test like the one used as the for condition but the code generated by GCC just unpacks the vector and checks the single elements one by one: .L2: vandps (%rax), %ymm1, %ymm1 vmovdqa %xmm1, %xmm0 addq $32, %rax vmovd %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $1, %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $2, %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $3, %xmm0, %edx testl %edx, %edx jne .L2 vextractf128 $0x1, %ymm1, %xmm0 vmovd %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $1, %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $2, %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $3, %xmm0, %edx testl %edx, %edx jne .L2 vzeroupper ret Is there any way to get GCC to generate an efficient test for that without reverting to using intrinsics? Update: For reference, code using an unportable GCC builtin for (V)PTEST: typedef uint32_t v8ui __attribute__ ((vector_size (32))); typedef long long int v4si __attribute__ ((vector_size (32))); const v8ui ones = { 1, 1, 1, 1, 1, 1, 1, 1 }; v8ui* foo(v8ui *mem) { v8ui v; for ( v = ones; !__builtin_ia32_ptestz256((v4si)v, (v4si)ones); mem++) v &= *(mem); return mem; } 解决方案 gcc 4.9.2 -O3 -mavx2 (in 64bit mode) didn't realize it could use ptest for this, with either || or |. The | version extracts the vector elements with vmovd and vpextrd, and combines things with 7 or insns between 32bit registers. So it's pretty bad, and doesn't take advantage of any simplifications that will still produce the same logical truth value. The || version is just as bad, and does the same extract-an-element-at-a-time, but does a test / jne for every one. So at this point, you can't count on GCC recognizing tests like this and doing anything remotely efficient. (pcmpeq / movmsk / test is another sequence that wouldn't be bad, but gcc doesn't generate that either.) 这篇关于使用矢量扩展时，让GCC生成PTEST指令的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用矢量扩展时，让GCC生成PTEST指令 [英] Getting GCC to generate a PTEST instruction when using vector extensions

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用矢量扩展时，让GCC生成PTEST指令 [英] Getting GCC to generate a PTEST instruction when using vector extensions

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭