采用ARM NEON内在函数添加α和置换 [英] Using ARM NEON intrinsics to add alpha and permute

查看:316
本文介绍了采用ARM NEON内在函数添加α和置换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开发一个iOS应用程序需要图片来自RGB转换 - > BGRA相当迅速。我想如果可能的话使用NEON内部函数。难道还有比简单地分配组件更快的方法?

 无效neonPermuteRGBtoBGRA(无符号字符* SRC,无符号字符* DST,诠释numPix)
{
    numPix / = 8; //过程8象素在一个时间    uint8x8_t阿尔法= vdup_n_u8(0xFF的);    的for(int i = 0; I< numPix;我++)
    {
        uint8x8x3_t RGB = vld3_u8(SRC);
        uint8x8x4_t BGRA;        bgra.val [0] = rgb.val [2]; //这些线是慢
        bgra.val [1] = rgb.val [1]; //这些线是慢
        bgra.val [2] = rgb.val [0]; //这些线是慢        bgra.val [3] =字母;        vst4_u8(DST,BGRA);        SRC = + 8 * 3;
        DST + = 8 * 4;
    }
}


解决方案

该ARMCC拆卸不是那么快可以:


  • 它不使用最合适的说明


  • 它结合了NEON那些引起巨大的打嗝每次VFP指令


试试这个:

  MOV R2,R2,LSR#3
  vmov.u8,D3,#0xff的
循环:
  vld3.8 {D0-D2},[R0]!
  潜艇R2,R2,#1
  VSWP D0,D2
  vst4.8 {D0-D3},[R1]!
  BGT循环  BX LR

我建议的code未完全优化下去,但进一步的真实的优化将严重损害可读性。所以我停在这里。

I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components?

void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
    numPix /= 8; //process 8 pixels at a time

    uint8x8_t alpha = vdup_n_u8 (0xff);

    for (int i=0; i<numPix; i++)
    {
        uint8x8x3_t rgb  = vld3_u8 (src);
        uint8x8x4_t bgra;

        bgra.val[0] = rgb.val[2]; //these lines are slow
        bgra.val[1] = rgb.val[1]; //these lines are slow 
        bgra.val[2] = rgb.val[0]; //these lines are slow

        bgra.val[3] = alpha;

        vst4_u8(dst, bgra);

        src += 8*3;
        dst += 8*4;
    }


}

解决方案

The ARMCC disassembly isn't that fast either :

  • It isn't using the most appropriate instructions

  • It mixes VFP instructions with NEON ones which causes huge hiccups every time

Try this :

  mov r2, r2, lsr #3
  vmov.u8, d3, #0xff
loop:
  vld3.8 {d0-d2}, [r0]!
  subs r2, r2, #1
  vswp d0, d2
  vst4.8 {d0-d3}, [r1]!
  bgt loop

  bx lr

My suggested code isn't fully optimized either, but further "real" optimizations would harm the readability seriously. So I stop here.

这篇关于采用ARM NEON内在函数添加α和置换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆