在Intel上使用SSE2减少无符号字节的总和而不会发生溢出 [英] Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

查看:97
本文介绍了在Intel上使用SSE2减少无符号字节的总和而不会发生溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Intel i3处理器上减少32个元素(每个1字节数据)的总和.我是这样做的:

I am trying to find sum reduction of 32 elements (each 1 byte data) on an Intel i3 processor. I did this:

s=0; 
for (i=0; i<32; i++)
{
    s = s + a[i];
}  

但是,这需要花费更多时间,因为我的应用程序是需要更少时间的实时应用程序. 请注意,最终的总和可能超过255.

However, its taking more time, since my application is a real-time application requiring much lesser time. Please note that the final sum could be more than 255.

有没有一种方法可以使用低级SIMD SSE2指令来实现?不幸的是,我从未使用过SSE.我尝试为此目的搜索sse2函数,但它也不可用. (sse)是否可以保证减少此类小问题的计算时间?

Is there a way I can implement this using low level SIMD SSE2 instructions? Unfortunately I have never used SSE. I tried searching for sse2 function for this purpose, but it is also not available. Is it (sse) guaranteed to reduce the computation time for such a small-sized problems?

有什么建议吗??

注意:我已经使用OpenCL和CUDA实现了类似的算法,并且效果很好,但是仅当问题很大时才有效.对于小型问题,开销成本更高.不确定它在SSE上如何工作

Note: I have implemented the similar algorithms using OpenCL and CUDA and that worked great but only when the problem size was big. For small sized problems the cost of overhead was more. Not sure how it works on SSE

推荐答案

您可以滥用PSADBW来快速计算较小的水平和.

You can abuse PSADBW to calculate small horizontal sums quickly.

类似这样的东西:(未经测试)

Something like this: (not tested)

pxor xmm0, xmm0
psadbw xmm0, [a + 0]
pxor xmm1, xmm1
psadbw xmm1, [a + 16]
paddw xmm0, xmm1
pshufd xmm1, xmm0, 2
paddw xmm0, xmm1 ; low word in xmm0 is the total sum


尝试的内在版本:


Attempted intrinsics version:

我从不使用内在函数,因此这段代码可能毫无意义.拆卸看起来还可以.

I never use intrinsics so this code probably makes no sense whatsoever. The disassembly looked OK though.

uint16_t sum_32(const uint8_t a[32])
{
    __m128i zero = _mm_xor_si128(zero, zero);
    __m128i sum0 = _mm_sad_epu8(
                        zero,
                        _mm_load_si128(reinterpret_cast<const __m128i*>(a)));
    __m128i sum1 = _mm_sad_epu8(
                        zero,
                        _mm_load_si128(reinterpret_cast<const __m128i*>(&a[16])));
    __m128i sum2 = _mm_add_epi16(sum0, sum1);
    __m128i totalsum = _mm_add_epi16(sum2, _mm_shuffle_epi32(sum2, 2));
    return totalsum.m128i_u16[0];
}

这篇关于在Intel上使用SSE2减少无符号字节的总和而不会发生溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆