为什么ARM的NEON并不比普通的C ++更快? [英] Why ARM NEON not faster than plain C++?
问题描述
下面是一个C ++ code:
Here is a C++ code:
#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )
void cpp_tst_add( unsigned* x, unsigned* y )
{
for ( register int i = 0; i < ARR_SIZE_TEST; ++i )
{
x[ i ] = x[ i ] + y[ i ];
}
}
下面是一个霓虹灯版本:
Here is a neon version:
void neon_assm_tst_add( unsigned* x, unsigned* y )
{
register unsigned i = ARR_SIZE_TEST >> 2;
__asm__ __volatile__
(
".loop1: \n\t"
"vld1.32 {q0}, [%[x]] \n\t"
"vld1.32 {q1}, [%[y]]! \n\t"
"vadd.i32 q0 ,q0, q1 \n\t"
"vst1.32 {q0}, [%[x]]! \n\t"
"subs %[i], %[i], $1 \n\t"
"bne .loop1 \n\t"
: [x]"+r"(x), [y]"+r"(y), [i]"+r"(i)
:
: "memory"
);
}
测试功能:
void bench_simple_types_test( )
{
unsigned* a = new unsigned [ ARR_SIZE_TEST ];
unsigned* b = new unsigned [ ARR_SIZE_TEST ];
neon_tst_add( a, b );
neon_assm_tst_add( a, b );
}
我已经测试这两个变种,这里是一个报告:
I have tested both variants and here are a report:
add, unsigned, C++ : 176 ms
add, unsigned, neon asm : 185 ms // SLOW!!!
我还测试了其他类型的:
I also tested other types:
add, float, C++ : 571 ms
add, float, neon asm : 184 ms // FASTER X3!
问题:
为什么霓虹灯是32位整型慢?
THE QUESTION: Why neon is slower with 32-bit integer types?
我曾经为Android NDK GCC的最后一个版本。 NEON优化标志被打开。
这里是一个反汇编C ++版本:
I used last version of GCC for Android NDK. NEON optimization flags were turned on. Here is a disassembled C++ version:
MOVS R3, #0
PUSH {R4}
loc_8
LDR R4, [R0,R3]
LDR R2, [R1,R3]
ADDS R2, R4, R2
STR R2, [R0,R3]
ADDS R3, #4
CMP.W R3, #0x2000000
BNE loc_8
POP {R4}
BX LR
下面拆开霓虹灯的版本:
Here is disassembled version of neon:
MOV.W R3, #0x200000
.loop1
VLD1.32 {D0-D1}, [R0]
VLD1.32 {D2-D3}, [R1]!
VADD.I32 Q0, Q0, Q1
VST1.32 {D0-D1}, [R0]!
SUBS R3, #1
BNE .loop1
BX LR
下面是所有的基准测试:
Here is all bench tests:
add, char, C++ : 83 ms
add, char, neon asm : 46 ms FASTER x2
add, short, C++ : 114 ms
add, short, neon asm : 92 ms FASTER x1.25
add, unsigned, C++ : 176 ms
add, unsigned, neon asm : 184 ms SLOWER!!!
add, float, C++ : 571 ms
add, float, neon asm : 184 ms FASTER x3
add, double, C++ : 533 ms
add, double, neon asm : 420 ms FASTER x1.25
问题:
为什么霓虹灯是32位整型慢?
THE QUESTION: Why neon is slower with 32-bit integer types?
推荐答案
这是Cortex-A8的NEON的管道是按顺序执行,并限制了打了下错过(不重命名),所以你用内存的限制延迟(如您使用的不是L1 / L2缓存大小以上)。您code对从内存中加载值直接相关性,所以它会不断地拖延等待内存。这可以解释为什么NEON code是稍微(由一个很小的量)比非NEON慢。
The NEON pipeline on Cortex-A8 is in-order executing, and has limited hit-under-miss (no renaming), so you're limited by memory latency (as you're using more than L1/L2 cache size). Your code has immediate dependencies on the values loaded from memory, so it'll stall constantly waiting for memory. This would explain why the NEON code is slightly (by a tiny amount) slower than non-NEON.
您需要展开组装循环并增加负载和使用的,例如之间的距离:
You need to unroll the assembly loops and increase the distance between load and use, e.g:
vld1.32 {q0}, [%[x]]!
vld1.32 {q1}, [%[y]]!
vld1.32 {q2}, [%[x]]!
vld1.32 {q3}, [%[y]]!
vadd.i32 q0 ,q0, q1
vadd.i32 q2 ,q2, q3
...
有充裕的NEON寄存器,所以你可以把它打开了很多。整数code将遭受同样的问题,在较小的程度上是因为A8整数具有更好的命中下错过的,而不是拖延。该瓶颈将成为基准如此之大相比,L1 / L2高速缓存内存带宽/延迟。您可能还需要运行在较小的尺寸(4KB..256KB)基准,看看效果时,数据在L1和/或L2缓存完全
There's plenty of neon registers so you can unroll it a lot. Integer code will suffer the same issue, to a lesser extent because A8 integer has better hit-under-miss instead of stalling. The bottleneck is going to be memory bandwidth/latency for benchmarks so large compared to L1/L2 cache. You might also want to run the benchmark at smaller sizes (4KB..256KB) to see effects when data is cached entirely in L1 and/or L2.
这篇关于为什么ARM的NEON并不比普通的C ++更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!