使用 ARM NEON 内在函数添加 alpha 和 permute [英] Using ARM NEON intrinsics to add alpha and permute
问题描述
我正在开发一个 iOS 应用程序,需要相当快地从 RGB -> BGRA 转换图像.如果可能,我想使用 NEON 内在函数.有没有比简单地分配组件更快的方法?
I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components?
void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
numPix /= 8; //process 8 pixels at a time
uint8x8_t alpha = vdup_n_u8 (0xff);
for (int i=0; i<numPix; i++)
{
uint8x8x3_t rgb = vld3_u8 (src);
uint8x8x4_t bgra;
bgra.val[0] = rgb.val[2]; //these lines are slow
bgra.val[1] = rgb.val[1]; //these lines are slow
bgra.val[2] = rgb.val[0]; //these lines are slow
bgra.val[3] = alpha;
vst4_u8(dst, bgra);
src += 8*3;
dst += 8*4;
}
}
推荐答案
ARMCC 的反汇编也没那么快:
The ARMCC disassembly isn't that fast either :
它没有使用最合适的说明
It isn't using the most appropriate instructions
它混合了 VFP 指令和 NEON 指令,每次都会导致巨大的问题
It mixes VFP instructions with NEON ones which causes huge hiccups every time
试试这个:
mov r2, r2, lsr #3
vmov.u8, d3, #0xff
loop:
vld3.8 {d0-d2}, [r0]!
subs r2, r2, #1
vswp d0, d2
vst4.8 {d0-d3}, [r1]!
bgt loop
bx lr
我建议的代码也没有完全优化,但进一步的真正"优化会严重损害可读性.所以我停在这里.
My suggested code isn't fully optimized either, but further "real" optimizations would harm the readability seriously. So I stop here.
这篇关于使用 ARM NEON 内在函数添加 alpha 和 permute的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!