检查如果两个SSE寄存器并不都为零,不破坏他们 [英] Checking if TWO SSE registers are not both zero without destroying them
问题描述
我想测试两个的 SSE 的寄存器并不都为零,不破坏它们。
I want to test if two SSE registers are not both zero without destroying them.
这是code我目前有:
This is the code I currently have:
uint8_t *src; // Assume it is initialized and 16-byte aligned
__m128i xmm0, xmm1, xmm2;
xmm0 = _mm_load_si128((__m128i const*)&src[i]); // Need to preserve xmm0 & xmm1
xmm1 = _mm_load_si128((__m128i const*)&src[i+16]);
xmm2 = _mm_or_si128(xmm0, xmm1);
if (!_mm_testz_si128(xmm2, xmm2)) { // Test both are not zero
}
这是最好的方式(最多可使用SSE 4.2)?
Is this the best way (using up to SSE 4.2)?
推荐答案
我学到了一些东西,从这个问题非常有用。让我们先看看一些标量code
I learned something useful from this question. Let's first look at some scalar code
extern foo2(int x, int y);
void foo(int x, int y) {
if((x || y)!=0) foo2(x,y);
}
编译此类似这样的 GCC -O3 -S -masm =英特尔test.c以
和重要组件
mov eax, edi ; edi = x, esi = y -> copy x into eax
or eax, esi ; eax = x | y and set zero flag in FLAGS if zero
jne .L4 ; jump not zero
现在让我们来看看测试SIMD寄存器为零。不同于标量code没有SIMD标记注册。然而,SSE4.1有哪些可以设置零标志(和标志)的标量标志位寄存器SIMD测试说明。
Now let's look at testing SIMD registers for zero. Unlike scalar code there is no SIMD FLAGS register. However, with SSE4.1 there are SIMD test instructions which can set the zero flag (and carry flag) in the scalar FLAGS register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
__m128i z = _mm_or_si128(x,y);
if (!_mm_testz_si128(z,z)) foo2(x,y);
}
与 C99 -msse4.1 -O3 -masm =英特尔-S test_SSE.c编译
和重要组件
movdqa xmm2, xmm0 ; xmm0 = x, xmm1 = y, copy x into xmm2
por xmm2, xmm1 ; xmm2 = x | y
ptest xmm2, xmm2 ; set zero flag if zero
jne .L4 ; jump not zero
请注意,这需要多一个指令,因为包装的逐位或不设置零标志。还要注意的是标版和SIMD版本都需要使用额外的寄存器( EAX
在标量情况和 XMM2
在SIMD情况)。 因此,要回答你的问题你目前的解决方案是你可以做的最好的。
Notice that this takes one more instruction because the packed bit-wise OR does not set the zero flag. Notice also that both the scalar version and the SIMD version need to use an additional register (eax
in the scalar case and xmm2
in the SIMD case). So to answer your question your current solution is the best you can do.
但是,如果你没有使用SSE4.1处理器或更好的,你将不得不使用另一种方法,只需要SSE2是使用 _mm_movemask_epi8
。 _mm_movemask_epi8
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(_mm_or_si128(x,y))) foo2(x,y);
}
重要组件
movdqa xmm2, xmm0
por xmm2, xmm1
pmovmskb eax, xmm2
test eax, eax
jne .L4
请注意,这需要多一个指令然后用SSE4.1 PTEST
指令。
到现在为止我已经使用了 pmovmaskb
指令,因为延迟是pre Sandy Bridge处理器比 PTEST $更好C $ C>。不过,我之前的Haswell意识到了这一点。在Haswell的
pmovmaskb
的延迟时间比 PTEST
的延迟差。他们都有相同的吞吐量。但是,在这种情况下,这是不是真的很重要。什么是重要的(这是我以前没有意识到)是 pmovmaskb
不设置标志注册,因此它需要另一个指令。 所以,现在我将使用 PTEST
在我的关键循环。谢谢你的问题。
Until now I have been using the pmovmaskb
instruction because the latency is better on pre Sandy Bridge processors than with ptest
. However, I realized this before Haswell. On Haswell the latency of pmovmaskb
is worse than the latency of ptest
. They both have the same throughput. But in this case this is not really important. What's important (which I did not realize before) is that pmovmaskb
does not set the FLAGS register and so it requires another instruction. So now I'll be using ptest
in my critical loop. Thank you for your question.
编辑:由OP的建议,有一种方法可以在不使用其他SSE寄存器来完成
as suggested by the OP there is a way this can be done without using another SSE register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(x) | _mm_movemask_epi8(y)) foo2(x,y);
}
从GCC相关的组件:
The relevant assembly from GCC is:
pmovmskb eax, xmm0
pmovmskb edx, xmm1
or edx, eax
jne .L4
而不是使用其他XMM寄存器这款采用双标量寄存器。
Instead of using another xmm register this uses two scalar registers.
需要注意的是更少的指令不一定意味着更好的性能。其中这些解决方案是最好的?你必须测试他们每个人找出来。
Note that fewer instructions does not necessarily mean better performance. Which of these solutions is best? You have to test each of them to find out.
这篇关于检查如果两个SSE寄存器并不都为零,不破坏他们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!