添加饱和的32位带符号的int内在函数? [英] Add saturate 32-bit signed ints intrinsics?
问题描述
有人可以推荐使用Intel内部函数(AVX,SSE4 ...)添加饱和的32位有符号整数的快速方法吗?
Can someone recommend a fast way to add saturate 32-bit signed integers using Intel intrinsics (AVX, SSE4 ...) ?
我查看了内在函数指南,发现 _mm256_adds_epi16
,但这似乎只增加了16位整数。我看不到32位的相似之处。
I looked at the intrinsics guide and found _mm256_adds_epi16
but this seems to only add 16-bit ints. I don't see anything similar for 32 bits. The other calls seem to wrap around.
推荐答案
在(且仅)以下情况下,会发生签名溢出:
A signed overflow will happen if (and only if):
- 两个输入的符号相同,
- 总和的符号(加上环绕符号时) )不同于输入
使用C操作符: overflow =〜(a ^ b)& (a ^(a + b))
。
此外,如果发生溢出,则饱和结果的符号将与任一输入相同。使用@PeterCordes建议的 int_min = int_max + 1
技巧,并假设您至少具有SSE4.1(对于 blendvps
)可以实现为:
Also, if an overflow happens, the saturated result will have the same sign as either input. Using the int_min = int_max+1
trick suggested by @PeterCordes, and assuming you have at least SSE4.1 (for blendvps
) this can be implemented as:
__m128i __mm_adds_epi32( __m128i a, __m128i b )
{
const __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );
// normal result (possibly wraps around)
__m128i res = _mm_add_epi32( a, b );
// If result saturates, it has the same sign as both a and b
__m128i sign_bit = _mm_srli_epi32(a, 31); // shift sign to lowest bit
__m128i saturated = _mm_add_epi32(int_max, sign_bit);
// saturation happened if inputs do not have different signs,
// but sign of result is different:
__m128i sign_xor = _mm_xor_si128( a, b );
__m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a,res));
return _mm_castps_si128(_mm_blendv_ps( _mm_castsi128_ps(saturated),
_mm_castsi128_ps( res ),
_mm_castsi128_ps( overflow ) ) );
}
如果您的 blendvps
与转换和加法一样快(或更快)(还考虑端口使用情况),您当然可以只混合 int_min
和 int_max
,其符号位为 a
。
此外,如果只有SSE2或SSE3,则可以通过向右移31位(溢出
)算术移位(溢出
)来替换最后一个混合, (使用and / andnot / or)。
If your blendvps
is as fast (or faster) than a shift and an addition (also considering port usage), you can of course just blend int_min
and int_max
, with the sign-bits of a
.
Also, if you have only SSE2 or SSE3, you can replace the last blend by an arithmetic shift (of overflow
) 31 bits to the right, and manual blending (using and/andnot/or).
自然,使用AVX2可以使用 __ m256i
变量代替 __ m128i
(应该很容易重写)。
And naturally, with AVX2 this can take __m256i
variables instead of __m128i
(should be very easy to rewrite).
附录 a
或 b
在编译时,您可以直接设置 saturated
,那么您可以保存两个 _mm_xor_si128
计算,即,溢出
将为 _mm_andnot_si128(b,res)
表示正的 a
和 _mm_andnot(res,b)
负的 a
(其中 res = a + b
)。
Addendum If you know the sign of either a
or b
at compile-time, you can directly set saturated
accordingly, and you can save both _mm_xor_si128
calculations, i.e., overflow
would be _mm_andnot_si128(b, res)
for positive a
and _mm_andnot(res, b)
for negative a
(with res = a+b
).
这篇关于添加饱和的32位带符号的int内在函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!