采用ARM NEON内在函数添加α和置换 [英] Using ARM NEON intrinsics to add alpha and permute
问题描述
我开发一个iOS应用程序需要图片来自RGB转换 - > BGRA相当迅速。我想如果可能的话使用NEON内部函数。难道还有比简单地分配组件更快的方法?
无效neonPermuteRGBtoBGRA(无符号字符* SRC,无符号字符* DST,诠释numPix)
{
numPix / = 8; //过程8象素在一个时间 uint8x8_t阿尔法= vdup_n_u8(0xFF的); 的for(int i = 0; I< numPix;我++)
{
uint8x8x3_t RGB = vld3_u8(SRC);
uint8x8x4_t BGRA; bgra.val [0] = rgb.val [2]; //这些线是慢
bgra.val [1] = rgb.val [1]; //这些线是慢
bgra.val [2] = rgb.val [0]; //这些线是慢 bgra.val [3] =字母; vst4_u8(DST,BGRA); SRC = + 8 * 3;
DST + = 8 * 4;
}
}
该ARMCC拆卸不是那么快可以:
-
它不使用最合适的说明
-
它结合了NEON那些引起巨大的打嗝每次VFP指令
试试这个:
MOV R2,R2,LSR#3
vmov.u8,D3,#0xff的
循环:
vld3.8 {D0-D2},[R0]!
潜艇R2,R2,#1
VSWP D0,D2
vst4.8 {D0-D3},[R1]!
BGT循环 BX LR
我建议的code未完全优化下去,但进一步的真实的优化将严重损害可读性。所以我停在这里。
I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components?
void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
numPix /= 8; //process 8 pixels at a time
uint8x8_t alpha = vdup_n_u8 (0xff);
for (int i=0; i<numPix; i++)
{
uint8x8x3_t rgb = vld3_u8 (src);
uint8x8x4_t bgra;
bgra.val[0] = rgb.val[2]; //these lines are slow
bgra.val[1] = rgb.val[1]; //these lines are slow
bgra.val[2] = rgb.val[0]; //these lines are slow
bgra.val[3] = alpha;
vst4_u8(dst, bgra);
src += 8*3;
dst += 8*4;
}
}
The ARMCC disassembly isn't that fast either :
It isn't using the most appropriate instructions
It mixes VFP instructions with NEON ones which causes huge hiccups every time
Try this :
mov r2, r2, lsr #3
vmov.u8, d3, #0xff
loop:
vld3.8 {d0-d2}, [r0]!
subs r2, r2, #1
vswp d0, d2
vst4.8 {d0-d3}, [r1]!
bgt loop
bx lr
My suggested code isn't fully optimized either, but further "real" optimizations would harm the readability seriously. So I stop here.
这篇关于采用ARM NEON内在函数添加α和置换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!