使用高达SSE 4.2比较两个16字节的值是否相等? [英] Compare two 16-byte values for equality using up to SSE 4.2?

查看:233
本文介绍了使用高达SSE 4.2比较两个16字节的值是否相等?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的结构:

  struct {
uint32_t a;
uint16_t b;
uint16_t c;
uint16_t d;
uint8_t e;
} s;

我想以最快的方式比较两个上述结构的相等性。我查看了英特尔Intrinsics指南,但无法找到整数比较,可用的选项主要是双精度和单浮点矢量输入。



有人可以请建议最好的方法?我可以添加一个联合到我的结构,使处理更容易。



我有限(现在)使用SSE4.2,但任何AVX答案也将欢迎,如果他们明显更快。我使用GCC 4.8.2

解决方案

@ zx485应该写的是:

  .data 
mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0, 0,0
.code
pxor xmm1,xmm2;等于psubb,但在所有3个向量执行端口上运行
ptest xmm1,xmmword ptr [mask11byte]; SSE 4.1
setz al; AL = TRUE for equal

只要没有什么不好的结果需要在之前屏蔽掉您的操作数,即使它们包含垃圾。由于 PTEST 在其操作中执行按位AND,因此您不需要单独的 PAND



有一段时间,我认为我有一个版本,可以使用更少的空间和更少的uops,但我最终需要一个额外的指令,因为没有 pcmpneq (所以我需要一个逻辑而不是)。因此,它更小,相同数量的uop,但显着更差的延迟。

  .code 
PCMPEQB xmm1,xmm2;字节xmm1 = 0xFF等于
PMOVMSKB eax,xmm1; ax = xmm1的每个字节的高位
NOT eax
TEST eax,0x7FF;如果所有低11位为零,则设置零标志
SETZ al; 17个字节

;或者用BMI1的ANDN减少一个insn。少一个uop如果测试不能宏熔丝
ANDN eax,eax,[mask11bits];只测试低11位。
; ANDN版本需要20个字节,加上2B的数据
.data
mask11bits dw 07ffh

test 可以使用 jcc 宏熔丝,所以如果你使用这个跳转条件,做 setz ,你会在尺寸上领先。 (因为你不需要16B掩码常量。)



ptest 需要2个uops, c $ c> ptest 版本为4 uops(包括 jcc 或其他指令)。 pmovmskb 版本也是4 uops与测试 / jcc 宏融合分支,但5与 cmovcc / setcc 。 (4 andn setcc / cmovcc / jcc ,因为它不能宏指令。)



(Agner Fog的表格 ptest 在Sandybridge上使用1个融合域uop,在支持它的所有其他Intel CPU上使用2个)。



Haswell上的延迟(如果分支预测不好,很重要):




  • :1 + ptest :2 = 3个周期

  • pcmpeqb :1 + :1 + :3 + code>:1 = 6个周期

  • pcmpeqb :1 + pmovmskb :3 + andn :1 = 5个周期(但不是宏融合,因此可能需要1个周期的延迟?)



    • 因此, ptest 版本具有明显更短的延迟: jcc

      在SnB / IvB上,2在Haswell上。


      I have a struct like this:

      struct {
          uint32_t a;
          uint16_t b;
          uint16_t c;
          uint16_t d;
          uint8_t  e;
      } s;
      

      and I would like to compare two of the above structs for equality, in the fastest way possible. I looked at the Intel Intrinsics Guide but couldn't find a compare for integers, the options available were mainly doubles and single-floating point vector-inputs.

      Could somebody please advise the best approach? I can add a union to my struct to make processing easier.

      I am limited (for now) to using SSE4.2, but any AVX answers would be welcome too if they are significantly faster. I am using GCC 4.8.2

      解决方案

      What @zx485 should have written is:

      .data
        mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
      .code
        pxor xmm1, xmm2  ; equiv to psubb, but runs on all 3 vector execution ports
        ptest xmm1, xmmword ptr [mask11byte]   ; SSE 4.1
        setz al     ; AL=TRUE for equal
      

      As long as nothing bad happens (floating point exceptions), you don't need to mask off your operands before computation, even if they hold garbage. And since PTEST does a bitwise AND as part of its operation, you don't need a separate PAND at all.

      For a while, I thought I had a version that could use less space and fewer uops, but I ended up needing an extra instruction because there's no pcmpneq (so I needed a logical not). So it's smaller, the same number of uops, but significantly worse latency.

      .code
        PCMPEQB xmm1, xmm2  ; bytes of xmm1 = 0xFF on equal
        PMOVMSKB eax, xmm1  ; ax = high bit of each byte of xmm1
        NOT eax
        TEST eax, 0x7FF  ; zero flag set if all the low 11 bits are zero
        SETZ al    ; 17 bytes
      
      ; Or one fewer insn with BMI1's ANDN.  One fewer uop if test can't macro-fuse
        ANDN eax, eax, [mask11bits]   ; only test the low 11 bits.
      ;  ANDN version takes 20 bytes, plus 2B of data
      .data
        mask11bits dw 07ffh
      

      test can macro-fuse with jcc, so if you're using this as a jump condition instead of actually doing setz, you come out ahead on size. (since you don't need the 16B mask constant.)

      ptest takes 2 uops, so the ptest version is 4 uops total (including the jcc or other instruction). The pmovmskb version is also 4 uops with a test/jcc macro-fused branch, but 5 with cmovcc / setcc. (4 with andn, with any of setcc / cmovcc / jcc since it can't macro-fuse`.)

      (Agner Fog's table says ptest takes 1 fused-domain uop on Sandybridge, 2 on all other Intel CPUs that support it. I'm not sure I believe that, though.)

      Latency on Haswell (important if the branch doesn't predict well):

      • pxor: 1 + ptest: 2 = 3 cycles
      • pcmpeqb: 1 + pmovmskb: 3 + not: 1 + test: 1 = 6 cycles
      • pcmpeqb: 1 + pmovmskb: 3 + andn: 1 = 5 cycles (but not macro-fused, so possibly 1 more cycle of latency?)

      So the ptest version has significantly shorter latency: jcc can execute sooner, to detect branch mispredicts sooner.

      Agner Fog's tests show ptest has latency = 3 on Nehalem, 1 on SnB/IvB, 2 on Haswell.

      这篇关于使用高达SSE 4.2比较两个16字节的值是否相等?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆