使用 sse 内在函数对 (A)RGB32 图像进行最快 50% 缩放 [英] Fastest 50% scaling of (A)RGB32 images using sse intrinsics

查看:34
本文介绍了使用 sse 内在函数对 (A)RGB32 图像进行最快 50% 缩放的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 C++ 中尽可能快地缩小图像.本文描述了如何有效地将 32 位 rgb 图像平均降低 50%.它很快,看起来不错.

I want to scale down images as fast as I can in c++. This article describes how to efficiently average 32bit rgb images down 50%. It is fast and looks good.

我尝试使用 sse 内在函数修改该方法.无论是否启用 SSE,下面的代码都有效.然而,令人惊讶的是,加速可以忽略不计.

I have tried modifying that approach using sse intrinsics. The code below works, with or without SSE enabled. Surprisingly, though, the speedup is negligible.

谁能看到改进 SSE 代码的方法.创建 vars shuffle1 和 shuffle2 的两行似乎是候选者(使用一些巧妙的移位或类似方法).

Can anybody see a way of improving the SSE code. The two lines creating vars shuffle1 and shuffle2 seems two be candidates(using some clever shifting or similar).

/*
 * Calculates the average of two rgb32 pixels.
 */
inline static uint32_t avg(uint32_t a, uint32_t b)
{
    return (((a^b) & 0xfefefefeUL) >> 1) + (a&b);
}

/*
 * Calculates the average of four rgb32 pixels.
 */
inline static uint32_t avg(const uint32_t a[2], const uint32_t b[2])
{
    return avg(avg(a[0], a[1]), avg(b[0], b[1]));
}

/*
 * Calculates the average of two rows of rgb32 pixels.
 */
void average2Rows(const uint32_t* src_row1, const uint32_t* src_row2, uint32_t* dst_row, int w)
{
#if !defined(__SSE)
        for (int x = w; x; --x, dst_row++, src_row1 += 2, src_row2 += 2)
            * dst_row = avg(src_row1, src_row2);
#else
        for (int x = w; x; x-=4, dst_row+=4, src_row1 += 8, src_row2 += 8)
        {
            __m128i left  = _mm_avg_epu8(_mm_load_si128((__m128i const*)src_row1), _mm_load_si128((__m128i const*)src_row2));
            __m128i right = _mm_avg_epu8(_mm_load_si128((__m128i const*)(src_row1+4)), _mm_load_si128((__m128i const*)(src_row2+4)));

            __m128i shuffle1 = _mm_set_epi32( right.m128i_u32[2], right.m128i_u32[0], left.m128i_u32[2], left.m128i_u32[0]);
            __m128i shuffle2 = _mm_set_epi32( right.m128i_u32[3], right.m128i_u32[1], left.m128i_u32[3], left.m128i_u32[1]);

            _mm_store_si128((__m128i *)dst_row, _mm_avg_epu8(shuffle1, shuffle2));
        }
#endif
}

推荐答案

在通用寄存器和 SSE 寄存器之间传输数据真的很慢,所以你应该避免这样的事情:

Transferring data between general purpose registers and SSE registers is really slow, so you should refrain from things like :

__m128i shuffle1 = _mm_set_epi32( right.m128i_u32[2], right.m128i_u32[0], left.m128i_u32[2], left.m128i_u32[0]);
__m128i shuffle2 = _mm_set_epi32( right.m128i_u32[3], right.m128i_u32[1], left.m128i_u32[3], left.m128i_u32[1]);

在相应的shuffle操作的帮助下对SSE寄存器中的值进行shuffle.

Shuffle the values in the SSE registers with the help of the according shuffle operations.

这应该是你要找的:

__m128i t0 = _mm_unpacklo_epi32( left, right ); // right.m128i_u32[1] left.m128i_u32[1] right.m128i_u32[0] left.m128i_u32[0]
__m128i t1 = _mm_unpackhi_epi32( left, right ); // right.m128i_u32[3] left.m128i_u32[3] right.m128i_u32[2] left.m128i_u32[2]
__m128i shuffle1 = _mm_unpacklo_epi32( t0, t1 );    // right.m128i_u32[2] right.m128i_u32[0] left.m128i_u32[2] left.m128i_u32[0]
__m128i shuffle2 = _mm_unpackhi_epi32( t0, t1 );    // right.m128i_u32[3] right.m128i_u32[1] left.m128i_u32[3] left.m128i_u32[1]

这篇关于使用 sse 内在函数对 (A)RGB32 图像进行最快 50% 缩放的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆