具有SSE4.1内在函数的双线性滤波器 [英] Bilinear filter with SSE4.1 intrinsics

查看:135
本文介绍了具有SSE4.1内在函数的双线性滤波器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为一种习惯于使用内在函数的练习,我试图一次只为一个过滤的样本找出一个相当快速的双线性过滤函数-最高可达SSE41.

I am trying to figure out a reasonably fast bilinear filtering function just for one filtered sample at a time now as an exercise in getting used to using intrinsics - up to SSE41 is fine.

到目前为止,我有以下内容:

So far I have the following:

inline __m128i DivideBy255_8xUint16(const __m128i value)
{
    //  Blinn 16bit divide by 255 trick but across 8 packed 16bit values
    const __m128i plus128 = _mm_add_epi16(value, _mm_set1_epi16(128));
    const __m128i plus128ThenDivideBy256 = _mm_srli_epi16(plus128, 8);          //  TODO:   Should this be an arithmetic or logical shift or does it matter?
    const __m128i partial = _mm_add_epi16(plus128, plus128ThenDivideBy256);
    const __m128i result = _mm_srli_epi16(partial, 8);                          //  TODO:   Should this be an arithmetic or logical shift or does it matter?


    return result;
}


inline uint32_t BilinearSSE41(const uint8_t* data, uint32_t pitch, uint32_t width, uint32_t height, float u, float v)
{
    //  TODO:   There are probably intrinsics I haven't found yet to avoid using these?
    //  0x80 is high bit set which means zero out that component
    const __m128i unpack_fraction_u_mask = _mm_set_epi8(0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0);
    const __m128i unpack_fraction_v_mask = _mm_set_epi8(0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1);
    const __m128i unpack_two_texels_mask = _mm_set_epi8(0x80, 7, 0x80, 6, 0x80, 5, 0x80, 4, 0x80, 3, 0x80, 2, 0x80, 1, 0x80, 0);


    //  TODO:   Potentially wasting two channels of operations for now
    const __m128i size = _mm_set_epi32(0, 0, height - 1, width - 1);
    const __m128 uv = _mm_set_ps(0.0f, 0.0f, v, u);

    const __m128 floor_uv_f = _mm_floor_ps(uv);
    const __m128 fraction_uv_f = _mm_sub_ps(uv, floor_uv_f);
    const __m128 fraction255_uv_f = _mm_mul_ps(fraction_uv_f, _mm_set_ps1(255.0f));
    const __m128i fraction255_uv_i = _mm_cvttps_epi32(fraction255_uv_f);    //  TODO:   Did this get rounded correctly?

    const __m128i fraction255_u_i = _mm_shuffle_epi8(fraction255_uv_i, unpack_fraction_u_mask); //  Splat fraction_u*255 across all 16 bit words
    const __m128i fraction255_v_i = _mm_shuffle_epi8(fraction255_uv_i, unpack_fraction_v_mask); //  Splat fraction_v*255 across all 16 bit words

    const __m128i inverse_fraction255_u_i = _mm_sub_epi16(_mm_set1_epi16(255), fraction255_u_i);
    const __m128i inverse_fraction255_v_i = _mm_sub_epi16(_mm_set1_epi16(255), fraction255_v_i);

    const __m128i floor_uv_i = _mm_cvttps_epi32(floor_uv_f);
    const __m128i clipped_floor_uv_i = _mm_min_epu32(floor_uv_i, size); //  TODO:   I haven't clamped this probably if uv was less than zero yet...


    //  TODO:   Calculating the addresses in the SSE register set would maybe be better

    int u0 = _mm_extract_epi32(floor_uv_i, 0);
    int v0 = _mm_extract_epi32(floor_uv_i, 1);


    const uint8_t* row = data + (u0<<2) + pitch*v0;


    const __m128i row0_packed = _mm_loadl_epi64((const __m128i*)data);
    const __m128i row0 = _mm_shuffle_epi8(row0_packed, unpack_two_texels_mask);

    const __m128i row1_packed = _mm_loadl_epi64((const __m128i*)(data + pitch));
    const __m128i row1 = _mm_shuffle_epi8(row1_packed, unpack_two_texels_mask);


    //  Compute (row0*fraction)/255 + row1*(255 - fraction)/255 - probably slight precision loss across addition!
    const __m128i vlerp0 = DivideBy255_8xUint16(_mm_mullo_epi16(row0, fraction255_v_i));
    const __m128i vlerp1 = DivideBy255_8xUint16(_mm_mullo_epi16(row1, inverse_fraction255_v_i));
    const __m128i vlerp = _mm_adds_epi16(vlerp0, vlerp1);

    const __m128i hlerp0 = DivideBy255_8xUint16(_mm_mullo_epi16(vlerp, fraction255_u_i));
    const __m128i hlerp1 = DivideBy255_8xUint16(_mm_srli_si128(_mm_mullo_epi16(vlerp, inverse_fraction255_u_i), 16 - 2*4));
    const __m128i hlerp = _mm_adds_epi16(hlerp0, hlerp1);


    //  Pack down to 8bit from 16bit components and return 32bit ARGB result
    return _mm_extract_epi32(_mm_packus_epi16(hlerp, hlerp), 0);
}

代码假定图像数据为ARGB8,并具有额外的列和行以处理边缘情况而无需分支.

The code assumes the image data is ARGB8 and has an extra column and row to handle edge cases without having to branch.

我正在就可以用来减小这种混乱状态的大小的指令提出建议,当然还可以改善运行速度,以改善它!

I am after advice on what instructions I can use to bring down the size of this gangly mess and of course how it can be improved to run faster!

谢谢:)

推荐答案

在您的注释中加上小标题:待办事项:这是算术或逻辑上的转变,还是有关系?"

Noticed your comment "TODO: Should this be an arithmetic or logical shift or does it matter?"

算术移位适用于有符号整数.逻辑移位适用于无符号整数.

Arithmetic shift is for signed integers. Logical shift is for unsigned integers.

    0x80000000 >> 4 is 0xf8000000 // Arithmetic shift
    0x80000000 >> 4 is 0x08000000 // Logical shift

这篇关于具有SSE4.1内在函数的双线性滤波器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆