使用 ARM NEON 内在函数添加 alpha 和 permute [英] Using ARM NEON intrinsics to add alpha and permute

查看:71
本文介绍了使用 ARM NEON 内在函数添加 alpha 和 permute的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个 iOS 应用程序,需要相当快地从 RGB -> BGRA 转换图像.如果可能,我想使用 NEON 内在函数.有没有比简单地分配组件更快的方法?

I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components?

void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
    numPix /= 8; //process 8 pixels at a time

    uint8x8_t alpha = vdup_n_u8 (0xff);

    for (int i=0; i<numPix; i++)
    {
        uint8x8x3_t rgb  = vld3_u8 (src);
        uint8x8x4_t bgra;

        bgra.val[0] = rgb.val[2]; //these lines are slow
        bgra.val[1] = rgb.val[1]; //these lines are slow 
        bgra.val[2] = rgb.val[0]; //these lines are slow

        bgra.val[3] = alpha;

        vst4_u8(dst, bgra);

        src += 8*3;
        dst += 8*4;
    }


}

推荐答案

ARMCC 的反汇编也没那么快:

The ARMCC disassembly isn't that fast either :

  • 它没有使用最合适的说明

  • It isn't using the most appropriate instructions

它混合了 VFP 指令和 NEON 指令,每次都会导致巨大的问题

It mixes VFP instructions with NEON ones which causes huge hiccups every time

试试这个:

  mov r2, r2, lsr #3
  vmov.u8, d3, #0xff
loop:
  vld3.8 {d0-d2}, [r0]!
  subs r2, r2, #1
  vswp d0, d2
  vst4.8 {d0-d3}, [r1]!
  bgt loop

  bx lr

我建议的代码也没有完全优化,但进一步的真正"优化会严重损害可读性.所以我停在这里.

My suggested code isn't fully optimized either, but further "real" optimizations would harm the readability seriously. So I stop here.

这篇关于使用 ARM NEON 内在函数添加 alpha 和 permute的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆