将 16 字节字符串与 SSE 进行比较 [英] Compare 16 byte strings with SSE

查看:36
本文介绍了将 16 字节字符串与 SSE 进行比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 16 字节的字符串"(它们可能更短,但您可能会假设它们在末尾用零填充),但您可能不会假设它们是 16 字节对齐的(至少并非总是如此).

I have 16 byte 'strings' (they may be shorter but you may assume that they are padded with zeros at the end), but you may not assume they are 16 byte aligned (at least not always).

如何编写一个例程将它们(相等)与 SSE 内在函数进行比较?我发现此代码片段可能会有所帮助,但我不确定它是否合适?

How to write a routine that will compare them (for equality) with SSE intrinsics? I found this code fragment that could be of help but I', not sure if it is appropriate?

register __m128i xmm0, xmm1; 
register unsigned int eax; 

xmm0 = _mm_load_epi128((__m128i*)(a)); 
xmm1 = _mm_load_epi128((__m128i*)(b)); 

xmm0 = _mm_cmpeq_epi8(xmm0, xmm1); 

eax = _mm_movemask_epi8(xmm0); 

if(eax==0xffff) //equal 
else   //not equal 

有人能解释一下这个或写一个函数体吗?

Could someone explain this or write a function body?

它需要在 GCC/mingw 中工作(在 32 位 Windows 上).

It needs to work in GCC/mingw (on 32 bit Windows).

推荐答案

矢量比较指令将其结果作为 掩码,由全 1(真)或全 0(假)的元素组成) 根据对应源元素之间的比较.

Vector comparison instructions produce their result as a mask, of elements that are all-1s (true) or all-0s (false) according to the comparison between the corresponding source elements.

请参阅 https://stackoverflow.com/tags/x86/info 以获取一些链接,这些链接会告诉您那些内在函数.

See https://stackoverflow.com/tags/x86/info for some links that will tell you what those intrinsics do.

问题中的代码看起来应该可以工作.

如果您想找出哪些元素不相等,请使用 movemask 版本(pmovmskbmovmskps).你可以 tzcnt/bsf 对第一个匹配进行位扫描,或者 popcnt 来计算匹配.All-equal 给你一个 0xffff 掩码,non-equal 给你 0.

If you want to find out which elements were non-equal, then use the movemask version (pmovmskb or movmskps). You can tzcnt / bsf to bit-scan for the first match, or popcnt to count matches. All-equal gives you a 0xffff mask, non-equal gives you 0.

您可能想知道 SSE4.1 ptest 在这里是否有用.它可用,但实际上并没有更快,特别是如果您对结果进行分支而不是将其转换为布尔值 0/-1.

You might wonder if SSE4.1 ptest is useful here. It's usable but it's not actually faster, especially if you're branching on the result instead of turning it into a boolean 0 / -1.

 // slower alternative using SSE4.1 ptest
__m128i avec, bvec;
avec = _mm_loadu_si128((__m128i*)(a)); 
bvec = _mm_loadu_si128((__m128i*)(b)); 

__m128i diff = _mm_xor_si128(avec, bvec);  // XOR: all zero only if *a==*b

if(_mm_test_all_zeros(diff, diff))  { //equal 
} else {   //not equal 
}

在 asm 中,ptest 是 2 uops,并且不能与 jcc 条件分支进行宏融合.因此,前端的总 pxor + ptest + 分支是 4 uops,并且仍然会破坏其中一个输入,除非您有 AVX 将 xor 结果放在第三个注册.

In asm, ptest is 2 uops, and can't macro-fuse with a jcc conditional branch. So the total pxor + ptest + branch is 4 uops for the front-end, and still destroys one of the inputs unless you have AVX to put the xor result in a 3rd register.

pcmpeqb xmm0, xmm1/pmovmskb eax, xmm0/cmp/jcc 共3个uop,cmp/jcc融合1 uop 在 Intel 和 AMD CPU 上.

pcmpeqb xmm0, xmm1 / pmovmskb eax, xmm0 / cmp/jcc is a total of 3 uops, with the cmp/jcc fusing into 1 uop on Intel and AMD CPUs.

如果你有更宽的元素,你可以在 pcmpeqd/q 的结果上使用 movmskpsmovmskpd 以获得 4 位或2 位掩码.如果您想在不为每个元素除以 4 或 8 个字节的情况下进行位扫描或 popcnt,这将非常有用.(或者使用 AVX2,8 位或 4 位而不是 32 位掩码.)

If you have wider elements, you can use movmskps or movmskpd on the result of pcmpeqd/q to get a 4-bit or 2-bit mask. This is most useful if you want to bit-scan or popcnt without dividing by 4 or 8 bytes per element. (Or with AVX2, 8-bit or 4-bit instead of 32-bit mask.)

ptest 仅在您不需要任何额外指令来为其构建输入时才是一个好主意:测试是否为全零,带或不带掩码.例如检查每个元素或某些元素中的某些位.

ptest is only a good idea if you don't need any extra instruction to build an input for it: test for all-zeros or not, with or without a mask. e.g. to check some bits in every element, or in some elements.

这篇关于将 16 字节字符串与 SSE 进行比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆