添加饱和的32位带符号的int内在函数? [英] Add saturate 32-bit signed ints intrinsics?

查看:212
本文介绍了添加饱和的32位带符号的int内在函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以推荐使用Intel内部函数(AVX,SSE4 ...)添加饱和的32位有符号整数的快速方法吗?

Can someone recommend a fast way to add saturate 32-bit signed integers using Intel intrinsics (AVX, SSE4 ...) ?

我查看了内在函数指南,发现 _mm256_adds_epi16 ,但这似乎只增加了16位整数。我看不到32位的相似之处。

I looked at the intrinsics guide and found _mm256_adds_epi16 but this seems to only add 16-bit ints. I don't see anything similar for 32 bits. The other calls seem to wrap around.

推荐答案

在(且仅)以下情况下,会发生签名溢出:

A signed overflow will happen if (and only if):


  • 两个输入的符号相同,

  • 总和的符号(加上环绕符号时) )不同于输入

使用C操作符: overflow =〜(a ^ b)& (a ^(a + b))

此外,如果发生溢出,则饱和结果的符号将与任一输入相同。使用@PeterCordes建议的 int_min = int_max + 1 技巧,并假设您至少具有SSE4.1(对于 blendvps )可以实现为:

Also, if an overflow happens, the saturated result will have the same sign as either input. Using the int_min = int_max+1 trick suggested by @PeterCordes, and assuming you have at least SSE4.1 (for blendvps) this can be implemented as:

__m128i __mm_adds_epi32( __m128i a, __m128i b )
{
    const __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );

    // normal result (possibly wraps around)
    __m128i res      = _mm_add_epi32( a, b );

    // If result saturates, it has the same sign as both a and b
    __m128i sign_bit = _mm_srli_epi32(a, 31); // shift sign to lowest bit
    __m128i saturated = _mm_add_epi32(int_max, sign_bit);

    // saturation happened if inputs do not have different signs, 
    // but sign of result is different:
    __m128i sign_xor  = _mm_xor_si128( a, b );
    __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a,res));

    return _mm_castps_si128(_mm_blendv_ps( _mm_castsi128_ps(saturated),
                                          _mm_castsi128_ps( res ),
                                          _mm_castsi128_ps( overflow ) ) );
}

如果您的 blendvps 与转换和加法一样快(或更快)(还考虑端口使用情况),您当然可以只混合 int_min int_max ,其符号位为 a
此外,如果只有SSE2或SSE3,则可以通过向右移31位(溢出)算术移位(溢出)来替换最后一个混合, (使用and / andnot / or)。

If your blendvps is as fast (or faster) than a shift and an addition (also considering port usage), you can of course just blend int_min and int_max, with the sign-bits of a. Also, if you have only SSE2 or SSE3, you can replace the last blend by an arithmetic shift (of overflow) 31 bits to the right, and manual blending (using and/andnot/or).

自然,使用AVX2可以使用 __ m256i 变量代替 __ m128i (应该很容易重写)。

And naturally, with AVX2 this can take __m256i variables instead of __m128i (should be very easy to rewrite).

附录 a b 在编译时,您可以直接设置 saturated ,那么您可以保存两个 _mm_xor_si128 计算,即,溢出将为 _mm_andnot_si128(b,res)表示正的 a _mm_andnot(res,b)负的 a (其中 res = a + b )。

Addendum If you know the sign of either a or b at compile-time, you can directly set saturated accordingly, and you can save both _mm_xor_si128 calculations, i.e., overflow would be _mm_andnot_si128(b, res) for positive a and _mm_andnot(res, b) for negative a (with res = a+b).

这篇关于添加饱和的32位带符号的int内在函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆