采用ARM NEON内在函数添加α和置换 [英] Using ARM NEON intrinsics to add alpha and permute

查看：316 发布时间：2016/5/29 14:46:30 arm neon intrinsics cortex-a8

本文介绍了采用ARM NEON内在函数添加α和置换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我开发一个iOS应用程序需要图片来自RGB转换 - > BGRA相当迅速。我想如果可能的话使用NEON内部函数。难道还有比简单地分配组件更快的方法？

 无效neonPermuteRGBtoBGRA（无符号字符* SRC，无符号字符* DST，诠释numPix）
{
    numPix / = 8; //过程8象素在一个时间    uint8x8_t阿尔法= vdup_n_u8（0xFF的）;    的for（int i = 0; I＆LT; numPix;我++）
    {
        uint8x8x3_t RGB = vld3_u8（SRC）;
        uint8x8x4_t BGRA;        bgra.val [0] = rgb.val [2]; //这些线是慢
        bgra.val [1] = rgb.val [1]; //这些线是慢
        bgra.val [2] = rgb.val [0]; //这些线是慢        bgra.val [3] =字母;        vst4_u8（DST，BGRA）;        SRC = + 8 * 3;
        DST + = 8 * 4;
    }
}

解决方案

该ARMCC拆卸不是那么快可以：

它不使用最合适的说明

它结合了NEON那些引起巨大的打嗝每次VFP指令

试试这个：

  MOV R2，R2，LSR＃3
  vmov.u8，D3，＃0xff的
循环：
  vld3.8 {D0-D2}，[R0]！
  潜艇R2，R2，＃1
  VSWP D0，D2
  vst4.8 {D0-D3}，[R1]！
  BGT循环  BX LR

我建议的code未完全优化下去，但进一步的真实的优化将严重损害可读性。所以我停在这里。

I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components?

void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
    numPix /= 8; //process 8 pixels at a time

    uint8x8_t alpha = vdup_n_u8 (0xff);

    for (int i=0; i<numPix; i++)
    {
        uint8x8x3_t rgb  = vld3_u8 (src);
        uint8x8x4_t bgra;

        bgra.val[0] = rgb.val[2]; //these lines are slow
        bgra.val[1] = rgb.val[1]; //these lines are slow 
        bgra.val[2] = rgb.val[0]; //these lines are slow

        bgra.val[3] = alpha;

        vst4_u8(dst, bgra);

        src += 8*3;
        dst += 8*4;
    }


}

解决方案

The ARMCC disassembly isn't that fast either :

It isn't using the most appropriate instructions
It mixes VFP instructions with NEON ones which causes huge hiccups every time

Try this :

  mov r2, r2, lsr #3
  vmov.u8, d3, #0xff
loop:
  vld3.8 {d0-d2}, [r0]!
  subs r2, r2, #1
  vswp d0, d2
  vst4.8 {d0-d3}, [r1]!
  bgt loop

  bx lr

My suggested code isn't fully optimized either, but further "real" optimizations would harm the readability seriously. So I stop here.

这篇关于采用ARM NEON内在函数添加α和置换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

采用ARM NEON内在函数添加α和置换 [英] Using ARM NEON intrinsics to add alpha and permute

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

采用ARM NEON内在函数添加α和置换 [英] Using ARM NEON intrinsics to add alpha and permute

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭