优化利用NEON Cortex-A8的颜色转换 [英] Optimizing Cortex-A8 color conversion using NEON
问题描述
我目前在做一个颜色转换程序,以转换为YUY2 NV12。
我有一个功能,这是相当快,但速度不如我所期望的,主要是由于缓存未命中。
I am currently doing a color conversion routine in order to convert from YUY2 to NV12. I have a function which is quite fast, but not as fast as I would expect, mainly due to cache misses.
void convert_hd(uint8_t *orig, uint8_t *result) {
uint32_t width = 1280;
uint32_t height = 720;
uint8_t *lineOdd = orig;
uint8_t *lineEven = orig + width*2;
uint8_t *resultYOdd = result;
uint8_t *resultYEven = result + width;
uint8_t *resultUV = result + height*width;
uint32_t totalLoop = height/2;
while (totalLoop-- > 0) {
uint32_t lineLoop = 1280/32; // Bytes length: width*2, read by iter 16Bytes
while(lineLoop-- > 0) {
__asm__ __volatile__(
"pld [%[lineOdd]] \n\t"
"vld4.8 {d0, d1, d2, d3}, [%[lineOdd],:128]! \n\t" // d0:Y d1:U0 d2:Y d3:V0
"pld [%[lineEven]] \n\t"
"vld4.8 {d4, d5, d6, d7}, [%[lineOdd],:128]! \n\t" // d4:Y d5:U1 d6:Y d7:V1
"vld4.8 {d8, d9, d10, d11}, [%[lineEven],:128]! \n\t" // d8:Y d9:U0' d10:Y d11:V0'
"vld4.8 {d12, d13, d14, d15}, [%[lineEven],:128]! \n\t" // d12:Y d13:U1' d14:Y d15:V1'
"vhadd.u8 d1, d1, d9 \n\t" // (U0+U0') / 2
"vhadd.u8 d3, d3, d11 \n\t" // (V0+V0') / 2
"vhadd.u8 d5, d5, d13 \n\t" // (U1+U1') / 2
"vhadd.u8 d7, d7, d15 \n\t" // (V1+V1') / 2
// Save
"vst2.8 {d0, d2}, [%[resultYOdd],:128]! \n\t"
"vst2.8 {d4, d6}, [%[resultYOdd],:128]! \n\t"
"vst2.8 {d8, d10}, [%[resultYEven],:128]! \n\t"
"vst2.8 {d12, d14}, [%[resultYEven],:128]! \n\t"
"vst2.8 {d1, d3}, [%[resultUV],:128]! \n\t"
"vst2.8 {d5, d7}, [%[resultUV],:128]! \n\t"
: [lineOdd]"+r"(lineOdd), [lineEven]"+r"(lineEven), [resultYOdd]"+r"(resultYOdd), [resultYEven]"+r"(resultYEven), [resultUV]"+r"(resultUV)
:
: "memory"
);
}
lineOdd += width*2;
lineEven += width*2;
resultYOdd += width;
resultYEven += width;
}
}
当我问什么的OProfile正在的时候,它说以下内容:
When I ask oprofile what is taking time, it says the following :
: 220c: add r2, r0, #2560 ;
: 2210: add r3, r1, #1280 ;
: 2214: add ip, r1, #921600 ;
: 2218: push {r4, lr}
: 221c: mov r4, #360 ;
6 0.1243 10 0.5787 4 0.4561 : 2220: mov lr, #40 ; 0x28
9 0.1864 5 0.2894 0 0 : 2224: pld [r0]
45 0.9321 7 0.4051 3 0.3421 : 2228: vld4.8 {d0-d3}, [r0 :128]!
51 1.0563 7 0.4051 1 0.1140 : 222c: pld [r2]
1 0.0207 1 0.0579 0 0 : 2230: vld4.8 {d4-d7}, [r0 :128]!
1360 28.1690 770 44.5602 463 52.7936 : 2234: vld4.8 {d8-d11}, [r2 :128]!
980 20.2983 329 19.0394 254 28.9624 : 2238: vld4.8 {d12-d15}, [r2 :128]!
: 223c: vhadd.u8 d1, d1, d9
1000 20.7125 170 9.8380 104 11.8586 : 2240: vhadd.u8 d3, d3, d11
: 2244: vhadd.u8 d5, d5, d13
5 0.1036 2 0.1157 2 0.2281 : 2248: vhadd.u8 d7, d7, d15
: 224c: vst2.8 {d0,d2}, [r1 :128]!
1125 23.3016 293 16.9560 15 1.7104 : 2250: vst2.8 {d4,d6}, [r1 :128]!
34 0.7042 41 2.3727 0 0 : 2254: vst2.8 {d8,d10}, [r3 :128]!
74 1.5327 8 0.4630 0 0 : 2258: vst2.8 {d12,d14}, [r3 :128]!
60 1.2428 39 2.2569 6 0.6842 : 225c: vst2.8 {d1,d3}, [ip :128]!
53 1.0978 24 1.3889 14 1.5964 : 2260: vst2.8 {d5,d7}, [ip :128]!
: 2264: subs lr, lr, #1
0 0 0 0 1 0.1140 : 2268: bne 2224 <convert_hd+0x18>
11 0.2278 14 0.8102 10 1.1403 : 226c: subs r4, r4, #1
: 2270: add r0, r0, #2560 ;
: 2274: add r2, r2, #2560 ;
2 0.0414 6 0.3472 0 0 : 2278: add r1, r1, #1280 ;
: 227c: add r3, r3, #1280 ;
2 0.0414 1 0.0579 0 0 : 2280: bne 2220 <convert_hd+0x14>
: 2284: pop {r4, pc}
- 首两列是周期数(绝对值和相对值)
- 的两个下的是L1高速缓存未命中(绝对值和相对值)
- 最后的是L2高速缓存未命中(绝对值和相对值)
任何帮助将是AP preciated,因为这是一个相当艰巨的任务,现在,找出思路和避免缓存未命中...
Any help would be appreciated, as this is a quite difficult task right now to find out ideas and avoid cache misses...
谢谢!
推荐答案
高速缓存行长度固定为八个字(32字节)。除了 PLD
您目前有,你需要 PLD [lineEven +缓存行]
。在未命中 vld4.8 {D8-D11}
,这是 lineEven
的下半年。 PLD
将只取一个高速缓存行。此外,你应该修改 PLD
位置。把一个在头,另外前 vhadd
,也许旁边内存目标。然后,您可以活跃ALU和存储器单元并行。
The cache line length is fixed at eight words (32 bytes). In addition to the pld
you currently have, you need pld[lineEven+cacheLine]
. The misses are vld4.8 {d8-d11}
, which is the 2nd half of lineEven
. pld
will only fetch a cache line. Also, you should alter the pld
position. Put one at the head and another before vhadd
, maybe with next memory target. You then have the ALU and memory units active in parallel.
此外,交错 vst2.8 {D0,D2}
与 vhadd
;它看起来像大多数数据是内存传输。在 vhadd
将阻止对数据的依赖性,如 D9
您可能/可能没有加载 PLD
,但没有安排好。
Also, interleave vst2.8 {d0, d2}
with the vhadd
; It looks like most data is a memory transfer. The vhadd
will block on data dependencies, like d9
which you may/may not have loading from pld
, but not scheduled well.
我不那么熟悉的 NEON ,但下面是试图遵循什么我说。
I am not that familiar with NEON, but the following is an attempt to follow what I said.
__asm__ __volatile__(
"pld [%[lineOdd], #32]\n\t" // 2nd part of odd.
"vld4.8 {d0, d1, d2, d3}, [%[lineOdd],:128]!\n\t"
"pld [%[lineEven], #32]\n\t" // 2nd part of even.
"vld4.8 {d8, d9, d10, d11}, [%[lineEven],:128]!\n\t"
"vld4.8 {d4, d5, d6, d7}, [%[lineOdd],:128]!\n\t"
"vld4.8 {d12, d13, d14, d15}, [%[lineEven],:128]!\n\t"
"vhadd.u8 d1, d1, d9\n\t"
// First in memory pipe, so write early.
"vst2.8 {d0, d2}, [%[resultYOdd],:128]!\n\t"
"vhadd.u8 d3, d3, d11\n\t"
"vst2.8 {d8, d10}, [%[resultYEven],:128]!\n\t"
"vhadd.u8 d5, d5, d13\n\t"
"vst2.8 {d4, d6}, [%[resultYOdd],:128]! \n\t"
"vhadd.u8 d7, d7, d15\n\t"
"vst2.8 {d12, d14}, [%[resultYEven],:128]! \n\t"
"pld [%[lineOdd]]\n\t" // 1st part of odd.
"vst2.8 {d1, d3}, [%[resultUV],:128]! \n\t"
"pld [%[lineEven]]\n\t" // 1st part of even.
"vst2.8 {d5, d7}, [%[resultUV],:128]! \n\t"
: [lineOdd]"+r"(lineOdd), [lineEven]"+r"(lineEven),
[resultYOdd]"+r"(resultYOdd), [resultYEven]"+r"(resultYEven),
[resultUV]"+r"(resultUV)
:
: "memory"
);
东西我可能是错在 NEON 操作的步伐;我不知道你怎么注册全是(64/128),让更多的 PLD
也许需要等,最好是用加交错的存储操作。特别是,一些 dX的
会在别人面前装,他们将准备使用。否则,你的ALU( vhadd
)将阻塞等待数据加载。
Things I may have wrong are the stride of the NEON operations; I have no idea how wide your registers are (64/128), so more PLD
maybe needed, etc. It is better to interleave the store operations with the additions. Especially, some dX
will be loaded before others and they will be ready to use. Otherwise, your ALU (vhadd
) will block waiting for the data to load.
您也不妨的主要的有 PLD [lineOdd]
和循环PLD [lineEven]
的东西开始之前。
You may also wish to prime the loop with pld[lineOdd]
and pld[lineEven]
before things begin.
这篇关于优化利用NEON Cortex-A8的颜色转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!