使用SSE/AVX/AVX2检查__m128i的所有字节是否匹配单个字节 [英] Check all bytes of a __m128i for a match of a single byte using SSE/AVX/AVX2

查看:45
本文介绍了使用SSE/AVX/AVX2检查__m128i的所有字节是否匹配单个字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找有效的方法来计算以下函数:

I'm looking for efficient ways to compute the following function:

输入: __ m128i数据,uint8_t输入

输出:布尔值,指示 data 中的任何字节是否在 in 中.

Output: boolean indicating if any byte in data is in.

我实际上是在使用它们为容量为8的字节实现时空有效的堆栈.我最有效的解决方案是首先计算一个 __ m128i tmp ,所有字节均作为 in .然后检查 tmp \ xor数据中的任何字节是否为零字节.

I'm essentially using them to implement a space/time efficient stack for bytes with capacity 8. My most efficient solution is to first compute a __m128i tmp with all bytes as in. Then check if any bytes in tmp\xor data is a zero byte.

推荐答案

是的,AVX2具有高效的字节广播功能.具有全零掩码的SSSE3 pshufb 同样便宜,但是您必须创建随机控制向量.AVX512BW/F甚至具有单指令 <代码> vpbroadcastb/w/d/qx/y/zmm,r32 .(使用可选的遮罩,因此您可以根据需要将一些零或与现有矢量合并,例如,使用一位掩码将其插入某个位置.)

Yes, AVX2 has efficient byte-broadcast. SSSE3 pshufb with an all-zero mask is just as cheap, but you have to create the shuffle-control vector. AVX512BW/F even has single-instruction vpbroadcastb/w/d/q x/y/zmm, r32. (With optional masking, so you can zero some or merge with an existing vector if you want, e.g. insert at a position using a single-bit mask.)

幸运的是,编译器在实现 _mm_set1_epi8 时知道如何执行此操作,因此我们可以将其留给编译器.

Fortunately, compilers know how to do this when implementing _mm_set1_epi8 so we can leave it to the compiler.

然后将其简化为通常的 pcmpeqb / pmovmskb 以获得一个整数,该整数将具有一个 1 位用于匹配元素,您可以分支.

Then it just boils down to the usual pcmpeqb/pmovmskb to get an integer which will have a 1 bit for matching elements, which you can branch on.

// 0 for not found, non-zero for found.  (Bit position tells you where).
unsigned contains(__m128i data, uint8_t needle) {
    __m128i k = _mm_set1_epi8(needle);
    __m128i cmp = _mm_cmpeq_epi8(data, k);  // vector mask
    return _mm_movemask_epi8(cmp);          // integer bitmask 
}

正如你所期望,所有的编译器让这款采用该ASM(的 Godbolt )

contains(long long __vector(2), unsigned char):
    vmovd   xmm1, edi
    vpbroadcastb    xmm1, xmm1
    vpcmpeqb        xmm0, xmm0, xmm1
    vpmovmskb       eax, xmm0
    ret

除了MSVC,MSVC首先浪费了 movsx eax,dl 上的一条指令.(Windows x64在RDX中传递了第二个arg,而x86-64 System V在RDI中传递了第一个 integer arg.)

Except MSVC, which wastes an instruction on movsx eax, dl first. (Windows x64 passes the 2nd arg in RDX, vs. x86-64 System V passing the first integer arg in RDI.)

如果没有AVX2,则使用SSSE3或更高版本会得到类似的东西

Without AVX2, you'll get something like this with SSSE3 or above

# gcc8.3 -O3 -march=nehalem
contains(long long __vector(2), unsigned char):
    movd    xmm1, edi

    pxor    xmm2, xmm2
    pshufb  xmm1, xmm2         # _mm_shuffle_epi8(needle, _mm_setzero_si128())

    pcmpeqb xmm0, xmm1
    pmovmskb        eax, xmm0
    ret

或仅使用SSE2(x86-64的基准):

Or with just SSE2 (baseline for x86-64):

contains(long long __vector(2), unsigned char):
    mov     DWORD PTR [rsp-12], edi
    movd    xmm1, DWORD PTR [rsp-12]    # gcc's tune=generic strategy is still store/reload  /facepalm
    punpcklbw       xmm1, xmm1          # duplicate to low 2 bytes
    punpcklwd       xmm1, xmm1          # duplciate to low 4 bytes
    pshufd  xmm1, xmm1, 0               # broadcast

    pcmpeqb xmm1, xmm0
    pmovmskb        eax, xmm1
    ret


相关:

SIMD/SSE:如何检查所有向量元素是否均为非零( pxor + ptest + jcc = 4 uos vs. pcmpeqb + pmovmskb +宏融合的 test/jcc = 3 oups.)

SIMD/SSE: How to check that all vector elements are non-zero (pxor+ptest+jcc = 4 uops vs. pcmpeqb+pmovmskb + macro-fused test/jcc = 3 uops.)

非索引-SSE/AVX寄存器的零字节(查找匹配位置)

如何使用SIMD计算字符出现次数(像memchr一样,但是使用AVX2进行匹配计数而不是找到第一个匹配项.有效地积累了计数,并且有效地实现了水平和.)

How to count character occurrences using SIMD (like memchr but counting matches instead of finding the first, using AVX2. With efficient accumulation of the counts, and efficient horizontal sums.)

这篇关于使用SSE/AVX/AVX2检查__m128i的所有字节是否匹配单个字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆