使用NEON乘法累加在iOS [英] Using NEON multiply accumulate on iOS
问题描述
即使我为编制的ARMv7
只,NEON乘法累加内部函数似乎被分解成独立的繁殖和增加。
我已经通过6几种版本X $ C $的C到最新的4.5,与iOS的SDK 5遇到过,并与不同的优化设置,既有建筑通过X code和直接通过命令行。
和拆卸一些 TEST.CPP
包含 例如,建筑物
的#include< arm_neon.h>float32x4_t测试(float32x4_t一,float32x4_t B,float32x4_t C)
{
float32x4_t结果=一个;
结果= vmlaq_f32(结果,B,C);
返回结果;
}
与
铛++ -c -O3 -arch的ARMv7 -otest.oTEST.CPP
otool -arch的ARMv7 -TV test.o
结果
test.o:
(__TEXT,__文本)部分
__Z4test19__simd128_float32_tS_S_:
00000000 f10d0910 ADD.W R9,SP,#16 @为0x10
00000004 46ec MOV IP,SP
00000006 ecdc2b04 vldmia IP地址,{D18,D19}
0000000A ecd90b04 vldmia R9,{D16,D17}
0000000e ff420df0 vmul.f32 Q8,Q9,Q8
00000012 ec432b33 VMOV D19,R2,R3
00000016 ec410b32 VMOV D18,R0,R1
0000001a ef400de2 vadd.f32 Q8,Q8,Q9
0000001E ec510b30 VMOV R0,R1,D16
00000022 ec532b31 VMOV R2,R3,D17
00000026 4770 BX LR
而不是预期的使用 vmla.f32
的。
我是什么做错了,好吗?
这是不是一个错误或LLVM,铛的优化。程序armcc或者gcc系统产生VMLA如您所愿,但如果你读的的Cortex-A系列程序员指南V3 的,它说:
20.2.3计划
在某些情况下,可以有一个相当大的延迟,特别是VMLA乘法累加(对于整数五个周期;七个周期为一个浮点数)。使用这些指令code应该优化,以避免试图使用结果值之前,它已准备好,否则会出现失速。尽管有几个周期导致的延迟,这些指令做充分的管道有这么几种
操作可以同时处于飞行。
块引用>
块引用>因此,它是有道理的,LLVM-铛到VMLA分成乘法和累加注入该管道。
Even though I am compiling for
armv7
only, NEON multiply-accumulate intrinsics appear to be being decomposed into separate multiplies and adds.I've experienced this with several versions of Xcode up to the latest 4.5, with iOS SDKs 5 through 6, and with different optimisation settings, both building through Xcode and through the commandline directly.
For instance, building and disassembling some
test.cpp
containing#include <arm_neon.h> float32x4_t test( float32x4_t a, float32x4_t b, float32x4_t c ) { float32x4_t result = a; result = vmlaq_f32( result, b, c ); return result; }
with
clang++ -c -O3 -arch armv7 -o "test.o" test.cpp otool -arch armv7 -tv test.o
results in
test.o: (__TEXT,__text) section __Z4test19__simd128_float32_tS_S_: 00000000 f10d0910 add.w r9, sp, #16 @ 0x10 00000004 46ec mov ip, sp 00000006 ecdc2b04 vldmia ip, {d18-d19} 0000000a ecd90b04 vldmia r9, {d16-d17} 0000000e ff420df0 vmul.f32 q8, q9, q8 00000012 ec432b33 vmov d19, r2, r3 00000016 ec410b32 vmov d18, r0, r1 0000001a ef400de2 vadd.f32 q8, q8, q9 0000001e ec510b30 vmov r0, r1, d16 00000022 ec532b31 vmov r2, r3, d17 00000026 4770 bx lr
instead of the expected use of
vmla.f32
.What am I doing wrong, please?
解决方案It is either a bug or an optimization by llvm-clang. armcc or gcc produces vmla as you expect but if you read Cortex-A Series Programmer’s Guide v3, it says:
20.2.3 Scheduling
In some cases there can be a considerable latency, particularly VMLA multiply-accumulate (five cycles for an integer; seven cycles for a floating-point). Code using these instructions should be optimized to avoid trying to use the result value before it is ready, otherwise a stall will occur. Despite having a few cycles result latency, these instructions do fully pipeline so several operations can be in flight at once.
So it makes sense for llvm-clang to separate vmla into multiply and accumulate to fill the pipeline.
这篇关于使用NEON乘法累加在iOS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!