在 Intel 上使用 SSE2 减少无符号字节的总和而不会溢出 [英] Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

查看:35
本文介绍了在 Intel 上使用 SSE2 减少无符号字节的总和而不会溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在 Intel i3 处理器上找到 32 个元素(每个 1 字节数据)的总和减少.我是这样做的:

I am trying to find sum reduction of 32 elements (each 1 byte data) on an Intel i3 processor. I did this:

s=0; 
for (i=0; i<32; i++)
{
    s = s + a[i];
}  

但是,它需要更多时间,因为我的应用程序是一个需要更少时间的实时应用程序.请注意,最终和可能超过 255.

However, its taking more time, since my application is a real-time application requiring much lesser time. Please note that the final sum could be more than 255.

有没有办法使用低级 SIMD SSE2 指令来实现这一点?不幸的是,我从未使用过 SSE.为此,我尝试搜索 sse2 函数,但它也不可用.是否(sse)保证减少如此小规模问题的计算时间?

Is there a way I can implement this using low level SIMD SSE2 instructions? Unfortunately I have never used SSE. I tried searching for sse2 function for this purpose, but it is also not available. Is it (sse) guaranteed to reduce the computation time for such a small-sized problems?

有什么建议吗??

注意:我已经使用 OpenCL 和 CUDA 实现了类似的算法,效果很好,但仅当问题规模很大时.对于小型问题,开销成本更高.不确定它在 SSE 上的工作原理

Note: I have implemented the similar algorithms using OpenCL and CUDA and that worked great but only when the problem size was big. For small sized problems the cost of overhead was more. Not sure how it works on SSE

推荐答案

您可以滥用 PSADBW 来快速计算小的横向总和.

You can abuse PSADBW to calculate small horizontal sums quickly.

类似这样的:(未测试)

Something like this: (not tested)

pxor xmm0, xmm0
psadbw xmm0, [a + 0]
pxor xmm1, xmm1
psadbw xmm1, [a + 16]
paddw xmm0, xmm1
pshufd xmm1, xmm0, 2
paddw xmm0, xmm1 ; low word in xmm0 is the total sum

<小时>

尝试的内部函数版本:


Attempted intrinsics version:

我从不使用内在函数,所以这段代码可能毫无意义.不过拆解看起来还可以.

I never use intrinsics so this code probably makes no sense whatsoever. The disassembly looked OK though.

uint16_t sum_32(const uint8_t a[32])
{
    __m128i zero = _mm_xor_si128(zero, zero);
    __m128i sum0 = _mm_sad_epu8(
                        zero,
                        _mm_load_si128(reinterpret_cast<const __m128i*>(a)));
    __m128i sum1 = _mm_sad_epu8(
                        zero,
                        _mm_load_si128(reinterpret_cast<const __m128i*>(&a[16])));
    __m128i sum2 = _mm_add_epi16(sum0, sum1);
    __m128i totalsum = _mm_add_epi16(sum2, _mm_shuffle_epi32(sum2, 2));
    return totalsum.m128i_u16[0];
}

这篇关于在 Intel 上使用 SSE2 减少无符号字节的总和而不会溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆