使用 NEON 优化 Cortex-A8 颜色转换 [英] Optimizing Cortex-A8 color conversion using NEON

查看:24
本文介绍了使用 NEON 优化 Cortex-A8 颜色转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在执行颜色转换例程,以便从 YUY2 转换为 NV12.我有一个非常快的函数,但没有我预期的那么快,主要是由于缓存未命中.

I am currently doing a color conversion routine in order to convert from YUY2 to NV12. I have a function which is quite fast, but not as fast as I would expect, mainly due to cache misses.

void convert_hd(uint8_t *orig, uint8_t *result) {
uint32_t width          = 1280;
uint32_t height         = 720;
uint8_t *lineOdd        = orig;
uint8_t *lineEven       = orig + width*2;
uint8_t *resultYOdd     = result;
uint8_t *resultYEven    = result + width;
uint8_t *resultUV       = result + height*width;
uint32_t totalLoop      = height/2;

while (totalLoop-- > 0) {
  uint32_t lineLoop = 1280/32; // Bytes length: width*2, read by iter 16Bytes

  while(lineLoop-- > 0) {
    __asm__ __volatile__(
        "pld [%[lineOdd]]   \n\t"
        "vld4.8   {d0, d1, d2, d3}, [%[lineOdd],:128]!   \n\t" // d0:Y d1:U0 d2:Y d3:V0
        "pld [%[lineEven]]   \n\t"
        "vld4.8   {d4, d5, d6, d7}, [%[lineOdd],:128]!   \n\t" // d4:Y d5:U1 d6:Y d7:V1
        "vld4.8   {d8, d9, d10, d11}, [%[lineEven],:128]!  \n\t" // d8:Y d9:U0' d10:Y d11:V0'
        "vld4.8   {d12, d13, d14, d15}, [%[lineEven],:128]!  \n\t" // d12:Y d13:U1' d14:Y d15:V1'
        "vhadd.u8   d1, d1, d9    \n\t" // (U0+U0') / 2
        "vhadd.u8   d3, d3, d11    \n\t" // (V0+V0') / 2
        "vhadd.u8   d5, d5, d13    \n\t" // (U1+U1') / 2
        "vhadd.u8   d7, d7, d15    \n\t" // (V1+V1') / 2
        // Save
        "vst2.8 {d0, d2}, [%[resultYOdd],:128]!           \n\t"
        "vst2.8 {d4, d6}, [%[resultYOdd],:128]!           \n\t"
        "vst2.8 {d8, d10}, [%[resultYEven],:128]!          \n\t"
        "vst2.8 {d12, d14}, [%[resultYEven],:128]!          \n\t"
        "vst2.8 {d1, d3}, [%[resultUV],:128]!   \n\t"
        "vst2.8 {d5, d7}, [%[resultUV],:128]!   \n\t"
        : [lineOdd]"+r"(lineOdd), [lineEven]"+r"(lineEven), [resultYOdd]"+r"(resultYOdd), [resultYEven]"+r"(resultYEven), [resultUV]"+r"(resultUV)
        :
        : "memory"
    );
  }
  lineOdd += width*2;
  lineEven += width*2;
  resultYOdd += width;
  resultYEven += width;
}
}

当我问 oprofile 需要什么时间时,它会说:

When I ask oprofile what is taking time, it says the following :

                                           :    220c:   add r2, r0, #2560   ;
                                           :    2210:   add r3, r1, #1280   ;
                                           :    2214:   add ip, r1, #921600 ;
                                           :    2218:   push    {r4, lr}
                                           :    221c:   mov r4, #360    ;
 6  0.1243    10  0.5787     4  0.4561     :    2220:   mov lr, #40 ; 0x28
 9  0.1864     5  0.2894     0       0     :    2224:   pld [r0]
45  0.9321     7  0.4051     3  0.3421     :    2228:   vld4.8  {d0-d3}, [r0 :128]!
51  1.0563     7  0.4051     1  0.1140     :    222c:   pld [r2]
 1  0.0207     1  0.0579     0       0     :    2230:   vld4.8  {d4-d7}, [r0 :128]!
1360 28.1690   770 44.5602   463 52.7936     :    2234: vld4.8  {d8-d11}, [r2 :128]!
 980 20.2983   329 19.0394   254 28.9624     :    2238: vld4.8  {d12-d15}, [r2 :128]!
                                             :    223c: vhadd.u8    d1, d1, d9
1000 20.7125   170  9.8380   104 11.8586     :    2240: vhadd.u8    d3, d3, d11
                                             :    2244: vhadd.u8    d5, d5, d13
   5  0.1036     2  0.1157     2  0.2281     :    2248: vhadd.u8    d7, d7, d15
                                             :    224c: vst2.8  {d0,d2}, [r1 :128]!
1125 23.3016   293 16.9560    15  1.7104     :    2250: vst2.8  {d4,d6}, [r1 :128]!
  34  0.7042    41  2.3727     0       0     :    2254: vst2.8  {d8,d10}, [r3 :128]!
  74  1.5327     8  0.4630     0       0     :    2258: vst2.8  {d12,d14}, [r3 :128]!
  60  1.2428    39  2.2569     6  0.6842     :    225c: vst2.8  {d1,d3}, [ip :128]!
  53  1.0978    24  1.3889    14  1.5964     :    2260: vst2.8  {d5,d7}, [ip :128]!
                                             :    2264: subs    lr, lr, #1
   0       0     0       0     1  0.1140     :    2268: bne 2224 <convert_hd+0x18>
  11  0.2278    14  0.8102    10  1.1403     :    226c: subs    r4, r4, #1
                                             :    2270: add r0, r0, #2560   ;
                                             :    2274: add r2, r2, #2560   ;
   2  0.0414     6  0.3472     0       0     :    2278: add r1, r1, #1280   ;
                                             :    227c: add r3, r3, #1280   ;
   2  0.0414     1  0.0579     0       0     :    2280: bne 2220 <convert_hd+0x14>
                                             :    2284: pop {r4, pc}

  • 前两列是循环计数(绝对和相对)
  • 接下来的两个是 L1 缓存未命中(绝对和相对)
  • 最后一个是 L2 缓存未命中(绝对和相对)
  • 任何帮助将不胜感激,因为现在这是一项非常艰巨的任务,可以找出想法并避免缓存未命中...

    Any help would be appreciated, as this is a quite difficult task right now to find out ideas and avoid cache misses...

    谢谢!

    推荐答案

    缓存线长度固定为 8 个字(32 字节).除了您当前拥有的 pld 之外,您还需要 pld[lineEven+cacheLine].未命中的是 vld4.8 {d8-d11},它是 lineEven 的第二部分.pld 只会获取一个缓存行.此外,您应该更改 pld 位置.将一个放在头部,另一个放在 vhadd 之前,也许是下一个内存目标.然后,您可以并行激活 ALU 和内存单元.

    The cache line length is fixed at eight words (32 bytes). In addition to the pld you currently have, you need pld[lineEven+cacheLine]. The misses are vld4.8 {d8-d11}, which is the 2nd half of lineEven. pld will only fetch a cache line. Also, you should alter the pld position. Put one at the head and another before vhadd, maybe with next memory target. You then have the ALU and memory units active in parallel.

    此外,将 vst2.8 {d0, d2}vhadd 交错;看起来大多数数据都是内存传输.vhadd 将阻止数据依赖项,例如 d9,您可能/可能没有从 pld 加载,但没有安排好.

    Also, interleave vst2.8 {d0, d2} with the vhadd; It looks like most data is a memory transfer. The vhadd will block on data dependencies, like d9 which you may/may not have loading from pld, but not scheduled well.

    我对 NEON 不是很熟悉,但以下是尝试遵循我所说的.

    I am not that familiar with NEON, but the following is an attempt to follow what I said.

    __asm__ __volatile__(
        "pld [%[lineOdd], #32]\n\t" // 2nd part of odd.
        "vld4.8   {d0, d1, d2, d3}, [%[lineOdd],:128]!\n\t"
        "pld [%[lineEven], #32]\n\t" // 2nd part of even.
        "vld4.8   {d8, d9, d10, d11}, [%[lineEven],:128]!\n\t"
        "vld4.8   {d4, d5, d6, d7}, [%[lineOdd],:128]!\n\t"
        "vld4.8   {d12, d13, d14, d15}, [%[lineEven],:128]!\n\t" 
        "vhadd.u8   d1, d1, d9\n\t"
        // First in memory pipe, so write early.
        "vst2.8 {d0, d2}, [%[resultYOdd],:128]!\n\t"  
        "vhadd.u8   d3, d3, d11\n\t"
        "vst2.8 {d8, d10}, [%[resultYEven],:128]!\n\t"
        "vhadd.u8   d5, d5, d13\n\t"
        "vst2.8 {d4, d6}, [%[resultYOdd],:128]!           \n\t"
        "vhadd.u8   d7, d7, d15\n\t"
        "vst2.8 {d12, d14}, [%[resultYEven],:128]!          \n\t"
        "pld [%[lineOdd]]\n\t"   // 1st part of odd.
        "vst2.8 {d1, d3}, [%[resultUV],:128]!   \n\t"
        "pld [%[lineEven]]\n\t"  // 1st part of even.
        "vst2.8 {d5, d7}, [%[resultUV],:128]!   \n\t"
        : [lineOdd]"+r"(lineOdd), [lineEven]"+r"(lineEven),
          [resultYOdd]"+r"(resultYOdd), [resultYEven]"+r"(resultYEven),
          [resultUV]"+r"(resultUV)
        :
        : "memory"
    );
    

    我可能有错误的是NEON 操作的步幅;我不知道你的寄存器有多宽 (64/128),所以可能需要更多的 PLD 等等.最好将存储操作与添加操作交错.特别是,一些 dX 将在其他人之前加载,它们将可以使用.否则,您的 ALU (vhadd) 将阻塞等待数据加载.

    Things I may have wrong are the stride of the NEON operations; I have no idea how wide your registers are (64/128), so more PLD maybe needed, etc. It is better to interleave the store operations with the additions. Especially, some dX will be loaded before others and they will be ready to use. Otherwise, your ALU (vhadd) will block waiting for the data to load.

    您可能还希望在开始之前用 pld[lineOdd]pld[lineEven] prime 循环.

    You may also wish to prime the loop with pld[lineOdd] and pld[lineEven] before things begin.

    这篇关于使用 NEON 优化 Cortex-A8 颜色转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆