比较16个字节的字符串与SSE [英] Compare 16 byte strings with SSE

查看:387
本文介绍了比较16个字节的字符串与SSE的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有16个字节的'字符串'(它们可能会更短,但你可能会认为它们在末尾用零填充),但你可能不会认为它们是16字节对齐的(至少并不总是)。 >

如何编写一个例程,将它们与SSE内在函数进行比较(相等)?我发现这个代码片段可能有帮助,但我',不知道它是否合适?

 寄存器__m128i xmm0,xmm1 ; 
寄存器unsigned int eax;

xmm0 = _mm_load_epi128((__ m128i *)(a));
xmm1 = _mm_load_epi128((__ m128i *)(b));

xmm0 = _mm_cmpeq_epi8(xmm0,xmm1);

eax = _mm_movemask_epi8(xmm0);

if(eax == 0xffff)//等于$ b $ else else //不等于

有人可以解释这个或写一个函数体?



它需要在GCC / mingw(在32位Windows上)中工作。

解决方案

向量比较指令将它们的结果作为掩码生成全为1(true)或全部-0s(false)根据相应源元素之间的比较。

请参阅 https://stackoverflow.com/tags/x86/info ,这些链接会告诉你这些内部函数的作用。



你的代码看起来像它

使用SSE4.1(对于 ptest ),我可以试试:

  __ m128i avec,bvec; 

avec = _mm_loadu_si128((__ m128i *)(a));
bvec = _mm_loadu_si128((__ m128i *)(b));

avec = _mm_xor_si128(avec,bvec); // XOR:只有在* a == * b

if(_mm_test_all_zeros(avec,avec))//等于
else //不等于

使用 ptest 在速度和代码大小上只有很小的差异, pcmp / movemask。在这种情况下, ptest 实际上比较慢。 Stgatilov测试了它。 ptest 只有在您不需要任何额外的指令来为其构建输入时才可能更快:无论是否使用蒙版,都测试全零或全零。如果你想找出哪些元素是不相等的,那么使用movemask版本。如果你想找出哪些元素是不相等的,那么使用movemask版本。您可以 lzcnt popcnt 或任何其他位掩码的操作,如果它不是 0xffff


I have 16 byte 'strings' (they may be shorter but you may assume that they are padded with zeros at the end), but you may not assume they are 16 byte aligned (at least not always).

How to write a routine that will compare them (for equality) with SSE intrinsics? I found this code fragment that could be of help but I', not sure if it is appropriate?

register __m128i xmm0, xmm1; 
register unsigned int eax; 

xmm0 = _mm_load_epi128((__m128i*)(a)); 
xmm1 = _mm_load_epi128((__m128i*)(b)); 

xmm0 = _mm_cmpeq_epi8(xmm0, xmm1); 

eax = _mm_movemask_epi8(xmm0); 

if(eax==0xffff) //equal 
else   //not equal 

Could someone explain this or write a function body?

It needs to work in GCC/mingw (on 32 bit Windows).

解决方案

Vector comparison instructions produce their result as a mask, of elements that are all-1s (true) or all-0s (false) according to the comparison between the corresponding source elements.

See https://stackoverflow.com/tags/x86/info for some links that will tell you what those intrinsics do.

Your code looks like it should work.

With SSE4.1 (for ptest) I might try:

__m128i avec, bvec;

avec = _mm_loadu_si128((__m128i*)(a)); 
bvec = _mm_loadu_si128((__m128i*)(b)); 

avec = _mm_xor_si128(avec, bvec);  // XOR: all zero only if *a==*b

if(_mm_test_all_zeros(avec, avec)) //equal 
else   //not equal 

Using ptest is only a tiny difference in speed and code size, compared to pcmp / movemask. In this case, ptest is actually slower. Stgatilov tested it. ptest is probably faster only if you don't need any extra instruction to build an input for it: test for all-zeros or not, with or without a mask. The negated 1st arg to set the carry flag is rarely useful.

Also, if you want to find out which elements were non-equal, then use the movemask version. You can lzcnt, popcnt, or whatever other bit-count operations on the mask, if it's not 0xffff.

这篇关于比较16个字节的字符串与SSE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆