优化利用NEON Cortex-A8的颜色转换 [英] Optimizing Cortex-A8 color conversion using NEON

查看：388 发布时间：2016/5/29 14:48:40 assembly arm neon cpu-cache cortex-a8

本文介绍了优化利用NEON Cortex-A8的颜色转换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前在做一个颜色转换程序，以转换为YUY2 NV12。
我有一个功能，这是相当快，但速度不如我所期望的，主要是由于缓存未命中。

I am currently doing a color conversion routine in order to convert from YUY2 to NV12. I have a function which is quite fast, but not as fast as I would expect, mainly due to cache misses.

void convert_hd(uint8_t *orig, uint8_t *result) {
uint32_t width          = 1280;
uint32_t height         = 720;
uint8_t *lineOdd        = orig;
uint8_t *lineEven       = orig + width*2;
uint8_t *resultYOdd     = result;
uint8_t *resultYEven    = result + width;
uint8_t *resultUV       = result + height*width;
uint32_t totalLoop      = height/2;

while (totalLoop-- > 0) {
  uint32_t lineLoop = 1280/32; // Bytes length: width*2, read by iter 16Bytes

  while(lineLoop-- > 0) {
    __asm__ __volatile__(
        "pld [%[lineOdd]]   \n\t"
        "vld4.8   {d0, d1, d2, d3}, [%[lineOdd],:128]!   \n\t" // d0:Y d1:U0 d2:Y d3:V0
        "pld [%[lineEven]]   \n\t"
        "vld4.8   {d4, d5, d6, d7}, [%[lineOdd],:128]!   \n\t" // d4:Y d5:U1 d6:Y d7:V1
        "vld4.8   {d8, d9, d10, d11}, [%[lineEven],:128]!  \n\t" // d8:Y d9:U0' d10:Y d11:V0'
        "vld4.8   {d12, d13, d14, d15}, [%[lineEven],:128]!  \n\t" // d12:Y d13:U1' d14:Y d15:V1'
        "vhadd.u8   d1, d1, d9    \n\t" // (U0+U0') / 2
        "vhadd.u8   d3, d3, d11    \n\t" // (V0+V0') / 2
        "vhadd.u8   d5, d5, d13    \n\t" // (U1+U1') / 2
        "vhadd.u8   d7, d7, d15    \n\t" // (V1+V1') / 2
        // Save
        "vst2.8 {d0, d2}, [%[resultYOdd],:128]!           \n\t"
        "vst2.8 {d4, d6}, [%[resultYOdd],:128]!           \n\t"
        "vst2.8 {d8, d10}, [%[resultYEven],:128]!          \n\t"
        "vst2.8 {d12, d14}, [%[resultYEven],:128]!          \n\t"
        "vst2.8 {d1, d3}, [%[resultUV],:128]!   \n\t"
        "vst2.8 {d5, d7}, [%[resultUV],:128]!   \n\t"
        : [lineOdd]"+r"(lineOdd), [lineEven]"+r"(lineEven), [resultYOdd]"+r"(resultYOdd), [resultYEven]"+r"(resultYEven), [resultUV]"+r"(resultUV)
        :
        : "memory"
    );
  }
  lineOdd += width*2;
  lineEven += width*2;
  resultYOdd += width;
  resultYEven += width;
}
}

当我问什么的OProfile正在的时候，它说以下内容：

When I ask oprofile what is taking time, it says the following :

                                           :    220c:   add r2, r0, #2560   ;
                                           :    2210:   add r3, r1, #1280   ;
                                           :    2214:   add ip, r1, #921600 ;
                                           :    2218:   push    {r4, lr}
                                           :    221c:   mov r4, #360    ;
 6  0.1243    10  0.5787     4  0.4561     :    2220:   mov lr, #40 ; 0x28
 9  0.1864     5  0.2894     0       0     :    2224:   pld [r0]
45  0.9321     7  0.4051     3  0.3421     :    2228:   vld4.8  {d0-d3}, [r0 :128]!
51  1.0563     7  0.4051     1  0.1140     :    222c:   pld [r2]
 1  0.0207     1  0.0579     0       0     :    2230:   vld4.8  {d4-d7}, [r0 :128]!
1360 28.1690   770 44.5602   463 52.7936     :    2234: vld4.8  {d8-d11}, [r2 :128]!
 980 20.2983   329 19.0394   254 28.9624     :    2238: vld4.8  {d12-d15}, [r2 :128]!
                                             :    223c: vhadd.u8    d1, d1, d9
1000 20.7125   170  9.8380   104 11.8586     :    2240: vhadd.u8    d3, d3, d11
                                             :    2244: vhadd.u8    d5, d5, d13
   5  0.1036     2  0.1157     2  0.2281     :    2248: vhadd.u8    d7, d7, d15
                                             :    224c: vst2.8  {d0,d2}, [r1 :128]!
1125 23.3016   293 16.9560    15  1.7104     :    2250: vst2.8  {d4,d6}, [r1 :128]!
  34  0.7042    41  2.3727     0       0     :    2254: vst2.8  {d8,d10}, [r3 :128]!
  74  1.5327     8  0.4630     0       0     :    2258: vst2.8  {d12,d14}, [r3 :128]!
  60  1.2428    39  2.2569     6  0.6842     :    225c: vst2.8  {d1,d3}, [ip :128]!
  53  1.0978    24  1.3889    14  1.5964     :    2260: vst2.8  {d5,d7}, [ip :128]!
                                             :    2264: subs    lr, lr, #1
   0       0     0       0     1  0.1140     :    2268: bne 2224 <convert_hd+0x18>
  11  0.2278    14  0.8102    10  1.1403     :    226c: subs    r4, r4, #1
                                             :    2270: add r0, r0, #2560   ;
                                             :    2274: add r2, r2, #2560   ;
   2  0.0414     6  0.3472     0       0     :    2278: add r1, r1, #1280   ;
                                             :    227c: add r3, r3, #1280   ;
   2  0.0414     1  0.0579     0       0     :    2280: bne 2220 <convert_hd+0x14>
                                             :    2284: pop {r4, pc}

首两列是周期数（绝对值和相对值）

的两个下的是L1高速缓存未命中（绝对值和相对值）

最后的是L2高速缓存未命中（绝对值和相对值）

任何帮助将是AP preciated，因为这是一个相当艰巨的任务，现在，找出思路和避免缓存未命中...

Any help would be appreciated, as this is a quite difficult task right now to find out ideas and avoid cache misses...

谢谢！

推荐答案

高速缓存行长度固定为八个字（32字节）。除了 PLD 您目前有，你需要 PLD [lineEven +缓存行] 。在未命中 vld4.8 {D8-D11} ，这是 lineEven 的下半年。 PLD 将只取一个高速缓存行。此外，你应该修改 PLD 位置。把一个在头，另外前 vhadd ，也许旁边内存目标。然后，您可以活跃ALU和存储器单元并行。

The cache line length is fixed at eight words (32 bytes). In addition to the pld you currently have, you need pld[lineEven+cacheLine]. The misses are vld4.8 {d8-d11}, which is the 2nd half of lineEven. pld will only fetch a cache line. Also, you should alter the pld position. Put one at the head and another before vhadd, maybe with next memory target. You then have the ALU and memory units active in parallel.

此外，交错 vst2.8 {D0，D2} 与 vhadd ;它看起来像大多数数据是内存传输。在 vhadd 将阻止对数据的依赖性，如 D9 您可能/可能没有加载 PLD ，但没有安排好。

Also, interleave vst2.8 {d0, d2} with the vhadd; It looks like most data is a memory transfer. The vhadd will block on data dependencies, like d9 which you may/may not have loading from pld, but not scheduled well.

我不那么熟悉的 NEON ，但下面是试图遵循什么我说。

I am not that familiar with NEON, but the following is an attempt to follow what I said.

__asm__ __volatile__(
    "pld [%[lineOdd], #32]\n\t" // 2nd part of odd.
    "vld4.8   {d0, d1, d2, d3}, [%[lineOdd],:128]!\n\t"
    "pld [%[lineEven], #32]\n\t" // 2nd part of even.
    "vld4.8   {d8, d9, d10, d11}, [%[lineEven],:128]!\n\t"
    "vld4.8   {d4, d5, d6, d7}, [%[lineOdd],:128]!\n\t"
    "vld4.8   {d12, d13, d14, d15}, [%[lineEven],:128]!\n\t" 
    "vhadd.u8   d1, d1, d9\n\t"
    // First in memory pipe, so write early.
    "vst2.8 {d0, d2}, [%[resultYOdd],:128]!\n\t"  
    "vhadd.u8   d3, d3, d11\n\t"
    "vst2.8 {d8, d10}, [%[resultYEven],:128]!\n\t"
    "vhadd.u8   d5, d5, d13\n\t"
    "vst2.8 {d4, d6}, [%[resultYOdd],:128]!           \n\t"
    "vhadd.u8   d7, d7, d15\n\t"
    "vst2.8 {d12, d14}, [%[resultYEven],:128]!          \n\t"
    "pld [%[lineOdd]]\n\t"   // 1st part of odd.
    "vst2.8 {d1, d3}, [%[resultUV],:128]!   \n\t"
    "pld [%[lineEven]]\n\t"  // 1st part of even.
    "vst2.8 {d5, d7}, [%[resultUV],:128]!   \n\t"
    : [lineOdd]"+r"(lineOdd), [lineEven]"+r"(lineEven),
      [resultYOdd]"+r"(resultYOdd), [resultYEven]"+r"(resultYEven),
      [resultUV]"+r"(resultUV)
    :
    : "memory"
);

东西我可能是错在 NEON 操作的步伐;我不知道你怎么注册全是（64/128），让更多的 PLD 也许需要等，最好是用加交错的存储操作。特别是，一些 dX的会在别人面前装，他们将准备使用。否则，你的ALU（ vhadd ）将阻塞等待数据加载。

Things I may have wrong are the stride of the NEON operations; I have no idea how wide your registers are (64/128), so more PLD maybe needed, etc. It is better to interleave the store operations with the additions. Especially, some dX will be loaded before others and they will be ready to use. Otherwise, your ALU (vhadd) will block waiting for the data to load.

您也不妨的主要的有 PLD [lineOdd] 和循环PLD [lineEven] 的东西开始之前。

You may also wish to prime the loop with pld[lineOdd] and pld[lineEven] before things begin.

这篇关于优化利用NEON Cortex-A8的颜色转换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

优化利用NEON Cortex-A8的颜色转换 [英] Optimizing Cortex-A8 color conversion using NEON

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

优化利用NEON Cortex-A8的颜色转换 [英] Optimizing Cortex-A8 color conversion using NEON

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭