为什么ARM的NEON并不比普通的C ++更快？ [英] Why ARM NEON not faster than plain C++?

查看：2375 发布时间：2016/5/29 14:29:09 c++ arm neon cortex-a8

本文介绍了为什么ARM的NEON并不比普通的C ++更快？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

下面是一个C ++ code：

Here is a C++ code:

#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )

void cpp_tst_add( unsigned* x, unsigned* y )
{
    for ( register int i = 0; i < ARR_SIZE_TEST; ++i )
    {
        x[ i ] = x[ i ] + y[ i ];
    }
}

下面是一个霓虹灯版本：

Here is a neon version:

void neon_assm_tst_add( unsigned* x, unsigned* y )
{
    register unsigned i = ARR_SIZE_TEST >> 2;

    __asm__ __volatile__
    (
        ".loop1:                            \n\t"

        "vld1.32   {q0}, [%[x]]             \n\t"
        "vld1.32   {q1}, [%[y]]!            \n\t"

        "vadd.i32  q0 ,q0, q1               \n\t"
        "vst1.32   {q0}, [%[x]]!            \n\t"

        "subs     %[i], %[i], $1            \n\t"
        "bne      .loop1                    \n\t"

        : [x]"+r"(x), [y]"+r"(y), [i]"+r"(i)
        :
        : "memory"
    );
}

测试功能：

void bench_simple_types_test( )
{
    unsigned* a = new unsigned [ ARR_SIZE_TEST ];
    unsigned* b = new unsigned [ ARR_SIZE_TEST ];

    neon_tst_add( a, b );
    neon_assm_tst_add( a, b );
}

我已经测试这两个变种，这里是一个报告：

I have tested both variants and here are a report:

add, unsigned, C++       : 176 ms
add, unsigned, neon asm  : 185 ms // SLOW!!!

我还测试了其他类型的：

I also tested other types:

add, float,    C++       : 571 ms
add, float,    neon asm  : 184 ms // FASTER X3!

问题：
为什么霓虹灯是32位整型慢？

THE QUESTION: Why neon is slower with 32-bit integer types?

我曾经为Android NDK GCC的最后一个版本。 NEON优化标志被打开。
这里是一个反汇编C ++版本：

I used last version of GCC for Android NDK. NEON optimization flags were turned on. Here is a disassembled C++ version:

                 MOVS            R3, #0
                 PUSH            {R4}

 loc_8
                 LDR             R4, [R0,R3]
                 LDR             R2, [R1,R3]
                 ADDS            R2, R4, R2
                 STR             R2, [R0,R3]
                 ADDS            R3, #4
                 CMP.W           R3, #0x2000000
                 BNE             loc_8
                 POP             {R4}
                 BX              LR

下面拆开霓虹灯的版本：

Here is disassembled version of neon:

                 MOV.W           R3, #0x200000
.loop1
                 VLD1.32         {D0-D1}, [R0]
                 VLD1.32         {D2-D3}, [R1]!
                 VADD.I32        Q0, Q0, Q1
                 VST1.32         {D0-D1}, [R0]!
                 SUBS            R3, #1
                 BNE             .loop1
                 BX              LR

下面是所有的基准测试：

Here is all bench tests:

add, char,     C++       : 83  ms
add, char,     neon asm  : 46  ms FASTER x2

add, short,    C++       : 114 ms
add, short,    neon asm  : 92  ms FASTER x1.25

add, unsigned, C++       : 176 ms
add, unsigned, neon asm  : 184 ms SLOWER!!!

add, float,    C++       : 571 ms
add, float,    neon asm  : 184 ms FASTER x3

add, double,   C++       : 533 ms
add, double,   neon asm  : 420 ms FASTER x1.25

问题：
为什么霓虹灯是32位整型慢？

THE QUESTION: Why neon is slower with 32-bit integer types?

推荐答案

这是Cortex-A8的NEON的管道是按顺序执行，并限制了打了下错过（不重命名），所以你用内存的限制延迟（如您使用的不是L1 / L2缓存大小以上）。您code对从内存中加载值直接相关性，所以它会不断地拖延等待内存。这可以解释为什么NEON code是稍微（由一个很小的量）比非NEON慢。

The NEON pipeline on Cortex-A8 is in-order executing, and has limited hit-under-miss (no renaming), so you're limited by memory latency (as you're using more than L1/L2 cache size). Your code has immediate dependencies on the values loaded from memory, so it'll stall constantly waiting for memory. This would explain why the NEON code is slightly (by a tiny amount) slower than non-NEON.

您需要展开组装循环并增加负载和使用的，例如之间的距离：

You need to unroll the assembly loops and increase the distance between load and use, e.g:

vld1.32   {q0}, [%[x]]!
vld1.32   {q1}, [%[y]]!
vld1.32   {q2}, [%[x]]!
vld1.32   {q3}, [%[y]]!
vadd.i32  q0 ,q0, q1
vadd.i32  q2 ,q2, q3
...

有充裕的NEON寄存器，所以你可以把它打开了很多。整数code将遭受同样的问题，在较小的程度上是因为A8整数具有更好的命中下错过的，而不是拖延。该瓶颈将成为基准如此之大相比，L1 / L2高速缓存内存带宽/延迟。您可能还需要运行在较小的尺寸（4KB..256KB）基准，看看效果时，数据在L1和/或L2缓存完全

There's plenty of neon registers so you can unroll it a lot. Integer code will suffer the same issue, to a lesser extent because A8 integer has better hit-under-miss instead of stalling. The bottleneck is going to be memory bandwidth/latency for benchmarks so large compared to L1/L2 cache. You might also want to run the benchmark at smaller sizes (4KB..256KB) to see effects when data is cached entirely in L1 and/or L2.

这篇关于为什么ARM的NEON并不比普通的C ++更快？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么ARM的NEON并不比普通的C ++更快？ [英] Why ARM NEON not faster than plain C++?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

为什么ARM的NEON并不比普通的C ++更快？ [英] Why ARM NEON not faster than plain C++?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭