为什么乘法的氖本征函数比加法运算符慢? [英] Why are neon intrinsics for multiplication, addition slower than operators?
问题描述
我编写了一个测试应用程序,以比较c ++实现和霓虹优化的实现,以比较两个包含复数的向量的乘积.
霓虹灯实现比cpp快3倍.(代码1)
但是,如果我用乘法运算符 *
替换霓虹灯固有的乘法- vmulq_f32
来乘以两个霓虹灯寄存器,我的速度约为4倍.
然后,如果我也将霓虹灯内在函数替换为加减法-将 vaddq_f32
/ vsubq_f32
替换为 +
/-
要添加/减去两个霓虹灯寄存器,我的速度约为5倍.(代码2)
我不知道这是怎么回事?为什么氖内在函数比常规运算符要慢?
代码1(比cpp快3倍)-
//(a + ib)*(c + id)=(ac-bd)+ i(ad + bc)无效complex_mult_neon(std :: vector&float&inVec1,std :: vector&float&inVec2,std :: vector&float&outVec){float * src1 =& inVec1 [0];float * src2 =& inVec2 [0];float * dst =& outVec [0];float32x4x2_t reg_s1,reg_s2;float32x4_t reg_p1,reg_p2;float32x4x2_t reg_r;for(自动计数= inVec1.size();计数> 0;计数-= 8){reg_s1 = vld2q_f32(src1);src1 + = 8;reg_s2 = vld2q_f32(src2);src2 + = 8;//交流reg_p1 = vmulq_f32(reg_s1.val [0],reg_s2.val [0]);//bdreg_p2 = vmulq_f32(reg_s1.val [1],reg_s2.val [1]);//ac-bdreg_r.val [0] = vsubq_f32(reg_p1,reg_p2);//广告reg_p1 = vmulq_f32(reg_s1.val [0],reg_s2.val [1]);//公元前reg_p2 = vmulq_f32(reg_s1.val [1],reg_s2.val [0]);//ad + bcreg_r.val [1] = vaddq_f32(reg_p1,reg_p2);vst2q_f32(dst,reg_r);dst + = 8;}}
代码2(比cpp快5倍)-
void complex_mult_neon(...){//与上述相同...for(自动计数= inVec1.size();计数> 0;计数-= 8){reg_s1 = vld2q_f32(src1);src1 + = 8;reg_s2 = vld2q_f32(src2);src2 + = 8;//交流reg_p1 = reg_s1.val [0] * reg_s2.val [0];//bdreg_p2 = reg_s1.val [1] * reg_s2.val [1];//ac-bdreg_r.val [0] = reg_p1-reg_p2;//广告reg_p1 = reg_s1.val [0] * reg_s2.val [1];//公元前reg_p2 = reg_s1.val [1] * reg_s2.val [0];//ad + bcreg_r.val [1] = reg_p1 + reg_p2;vst2q_f32(dst,reg_r);dst + = 8;}}
cpp代码-
void complex_mult_cpp(std :: vector&float&inVec1,std :: vector&float&inVec2,std :: vector&float&outVec){浮点p1,p2;对于(auto i = 0; i< inVec1.size(); i + = 2){//交流p1 = inVec1 [i] * inVec2 [i];//bdp2 = inVec1 [i + 1] * inVec2 [i + 1];//ac-bdoutVec [i] = p1- p2;//广告p1 = inVec1 [i] * inVec2 [i + 1];//公元前p2 = inVec1 [i + 1] * inVec2 [i];//ad + bcoutVec [i + 1] = p1 + p2;}}
使用的工具-clang,ndk 16,三星S6(AT& T)
编辑-根据建议添加反汇编
所以我查看了代码1和代码2的反汇编-
代码1的反汇编(仅复制 ld2
和 st2
之间的相关部分)-
88:00 89 40 4c ld2 {v0.4s,v1.4s},[x8]8c:22 1c a1 4e mov v2.16b,v1.16b90:03 1c a0 4e mov v3.16b,v0.16b94:e8 07 40 f9 ldr x8,[sp,#8]98:03 55 80 3d str q3,[x8,#336]9c:02 59 80 3d str q2,[x8,#352]a0:02 55 c0 3d ldr q2,[x8,#336]a4:02 5d 80 3d str q2,[x8,#368]a8:02 59 c0 3d ldr q2,[x8,#352]ac:02 61 80 3d str q2,[x8,#384];outVec [i] = p1- p2;b0:02 5d c0 3d ldr q2,[x8,#368]b4:02 75 80 3d str q2,[x8,#464]b8:02 61 c0 3d ldr q2,[x8,#384]bc:02 79 80 3d str q2,[x8,#480]c0:e9 2b 40 f9 ldr x9,[sp,#80]c4:29 81 00 91加x9,x9,#32c8:e9 2b 00 f9 str x9,[sp,#80]抄送:e9 27 40 f9 ldr x9,[sp,#72]d0:20 89 40 4c ld2 {v0.4s,v1.4s},[x9];p1 = inVec1 [i] * inVec2 [i + 1];d4:22 1c a1 4e mov v2.16b,v1.16bd8:03 1c a0 4e mov v3.16b,v0.16bdc:03 45 80 3d str q3,[x8,#272]e0:02 49 80 3d str q2,[x8,#288]e4:02 45 c0 3d ldr q2,[x8,#272]e8:02 4d 80 3d str q2,[x8,#304]ec:02 49 c0 3d ldr q2,[x8,#288]f0:02 51 80 3d str q2,[x8,#320]f4:02 4d c0 3d ldr q2,[x8,#304]f8:02 6d 80 3d str q2,[x8,#432]fc:02 51 c0 3d ldr q2,[x8,#320]100:02 71 80 3d str q2,[x8,#448]104:e9 27 40 f9 ldr x9,[sp,#72]108:29 81 00 91添加x9,x9,#3210c:e9 27 00 f9 str x9,[sp,#72];p2 = inVec1 [i + 1] * inVec2 [i];110:02 75 c0 3d ldr q2,[x8,#464]114:03 6d c0 3d ldr q3,[x8,#432]118:e2 27 80 3d str q2,[sp,#144]11c:e3 23 80 3d str q3,[sp,#128]120:e2 27 c0 3d ldr q2,[sp,#144]124:e3 23 c0 3d ldr q3,[sp,#128]128:42 dc 23 6e fmul v2.4s,v2.4s,v3.4s12c:e2 1f 80 3d str q2,[sp,#112]130:e2 1f c0 3d ldr q2,[sp,#112]134:e2 0f 80 3d str q2,[sp,#48]138:02 79 c0 3d ldr q2,[x8,#480]13c:03 71 c0 3d ldr q3,[x8,#448]140:02 39 80 3d str q2,[x8,#224]144:03 35 80 3d str q3,[x8,#208]148:02 39 c0 3d ldr q2,[x8,#224];outVec [i + 1] = p1 + p2;14c:03 35 c0 3d ldr q3,[x8,#208]150:42 dc 23 6e fmul v2.4s,v2.4s,v3.4s154:02 31 80 3d str q2,[x8,#192]158:02 31 c0 3d ldr q2,[x8,#192]15c:e2 0b 80 3d str q2,[sp,#32]160:e2 0f c0 3d ldr q2,[sp,#48]164:e3 0b c0 3d ldr q3,[sp,#32]168:02 2d 80 3d str q2,[x8,#176]16c:03 29 80 3d str q3,[x8,#160]170:02 2d c0 3d ldr q2,[x8,#176]174:03 29 c0 3d ldr q3,[x8,#160]178:42 d4 a3 4e fsub v2.4s,v2.4s,v3.4s;对于(auto i = 0; i< inVec1.size(); i + = 2)17c:02 25 80 3d str q2,[x8,#144]180:02 25 c0 3d ldr q2,[x8,#144]184:02 65 80 3d str q2,[x8,#400]188:02 75 c0 3d ldr q2,[x8,#464];18c:03 71 c0 3d ldr q3,[x8,#448]190:02 21 80 3d str q2,[x8,#128]194:03 1d 80 3d str q3,[x8,#112]198:02 21 c0 3d ldr q2,[x8,#128]19c:03 1d c0 3d ldr q3,[x8,#112]1a0:42 dc 23 6e fmul v2.4s,v2.4s,v3.4s1a4:02 19 80 3d str q2,[x8,#96]1a8:02 19 c0 3d ldr q2,[x8,#96]1ac:e2 0f 80 3d str q2,[sp,#48]1b0:02 79 c0 3d ldr q2,[x8,#480]1b4:03 6d c0 3d ldr q3,[x8,#432]1b8:02 15 80 3d str q2,[x8,#80]1bc:03 11 80 3d str q3,[x8,#64]1c0:02 15 c0 3d ldr q2,[x8,#80]1c4:03 11 c0 3d ldr q3,[x8,#64]1c8:42 dc 23 6e fmul v2.4s,v2.4s,v3.4s1cc:02 0d 80 3d str q2,[x8,#48]1d0:02 0d c0 3d ldr q2,[x8,#48]1d4:e2 0b 80 3d str q2,[sp,#32]1d8:e2 0f c0 3d ldr q2,[sp,#48]1dc:e3 0b c0 3d ldr q3,[sp,#32]1e0:02 09 80 3d str q2,[x8,#32]1e4:03 05 80 3d str q3,[x8,#16]1e8:02 09 c0 3d ldr q2,[x8,#32]1ec:03 05 c0 3d ldr q3,[x8,#16]1f0:42 d4 23 4e fadd v2.4s,v2.4s,v3.4s1f4:02 01 80 3d str q2,[x8]1f8:02 01 c0 3d ldr q2,[x8]1fc:02 69 80 3d str q2,[x8,#416]200:02 65 c0 3d ldr q2,[x8,#400]204:02 3d 80 3d str q2,[x8,#240]208:02 69 c0 3d ldr q2,[x8,#416]20c:02 41 80 3d str q2,[x8,#256]210:e9 23 40 f9 ldr x9,[sp,#64]214:02 3d c0 3d ldr q2,[x8,#240]218:03 41 c0 3d ldr q3,[x8,#256]21c:40 1c a2 4e mov v0.16b,v2.16b220:61 1c a3 4e mov v1.16b,v3.16b224:20 89 00 4c st2 {v0.4s,v1.4s},[x9]
反汇编代码2-
88:00 89 40 4c ld2 {v0.4s,v1.4s},[x8]8c:22 1c a1 4e mov v2.16b,v1.16b90:03 1c a0 4e mov v3.16b,v0.16b94:e8 07 40 f9 ldr x8,[sp,#8]98:03 11 80 3d str q3,[x8,#64]9c:02 15 80 3d str q2,[x8,#80]a0:02 11 c0 3d ldr q2,[x8,#64]a4:02 19 80 3d str q2,[x8,#96]a8:02 15 c0 3d ldr q2,[x8,#80]ac:02 1d 80 3d str q2,[x8,#112];outVec [i] = p1- p2;b0:02 19 c0 3d ldr q2,[x8,#96]b4:02 31 80 3d str q2,[x8,#192]b8:02 1d c0 3d ldr q2,[x8,#112]bc:02 35 80 3d str q2,[x8,#208]c0:e9 2b 40 f9 ldr x9,[sp,#80]c4:29 81 00 91加x9,x9,#32c8:e9 2b 00 f9 str x9,[sp,#80]抄送:e9 27 40 f9 ldr x9,[sp,#72]d0:20 89 40 4c ld2 {v0.4s,v1.4s},[x9];p1 = inVec1 [i] * inVec2 [i + 1];d4:22 1c a1 4e mov v2.16b,v1.16bd8:03 1c a0 4e mov v3.16b,v0.16bdc:e3 27 80 3d str q3,[sp,#144]e0:02 05 80 3d str q2,[x8,#16]e4:e2 27 c0 3d ldr q2,[sp,#144]e8:02 09 80 3d str q2,[x8,#32]ec:02 05 c0 3d ldr q2,[x8,#16]f0:02 0d 80 3d str q2,[x8,#48]f4:02 09 c0 3d ldr q2,[x8,#32]f8:02 29 80 3d str q2,[x8,#160]fc:02 0d c0 3d ldr q2,[x8,#48]100:02 2d 80 3d str q2,[x8,#176]104:e9 27 40 f9 ldr x9,[sp,#72]108:29 81 00 91添加x9,x9,#3210c:e9 27 00 f9 str x9,[sp,#72];p2 = inVec1 [i + 1] * inVec2 [i];110:02 31 c0 3d ldr q2,[x8,#192]114:03 29 c0 3d ldr q3,[x8,#160]118:42 dc 23 6e fmul v2.4s,v2.4s,v3.4s11c:e2 0f 80 3d str q2,[sp,#48]120:02 35 c0 3d ldr q2,[x8,#208]124:03 2d c0 3d ldr q3,[x8,#176]128:42 dc 23 6e fmul v2.4s,v2.4s,v3.4s12c:e2 0b 80 3d str q2,[sp,#32]130:e2 0f c0 3d ldr q2,[sp,#48]134:e3 0b c0 3d ldr q3,[sp,#32]138:42 d4 a3 4e fsub v2.4s,v2.4s,v3.4s13c:02 21 80 3d str q2,[x8,#128]140:02 31 c0 3d ldr q2,[x8,#192]144:03 2d c0 3d ldr q3,[x8,#176]148:42 dc 23 6e fmul v2.4s,v2.4s,v3.4s;outVec [i + 1] = p1 + p2;14c:e2 0f 80 3d str q2,[sp,#48]150:02 35 c0 3d ldr q2,[x8,#208]154:03 29 c0 3d ldr q3,[x8,#160]158:42 dc 23 6e fmul v2.4s,v2.4s,v3.4s15c:e2 0b 80 3d str q2,[sp,#32]160:e2 0f c0 3d ldr q2,[sp,#48]164:e3 0b c0 3d ldr q3,[sp,#32]168:42 d4 23 4e fadd v2.4s,v2.4s,v3.4s16c:02 25 80 3d str q2,[x8,#144]170:02 21 c0 3d ldr q2,[x8,#128]174:e2 1f 80 3d str q2,[sp,#112]178:02 25 c0 3d ldr q2,[x8,#144];对于(auto i = 0; i< inVec1.size(); i + = 2)17c:e2 23 80 3d str q2,[sp,#128]180:e9 23 40 f9 ldr x9,[sp,#64]184:e2 1f c0 3d ldr q2,[sp,#112]188:e3 23 c0 3d ldr q3,[sp,#128];18c:40 1c a2 4e mov v0.16b,v2.16b190:61 1c a3 4e mov v1.16b,v3.16b194:20 89 00 4c st2 {v0.4s,v1.4s},[x9]
反汇编确实说明了加速的原因.请注意,在第一个代码中,它们在 fmul
和 fmul
之间有这么多(看似不必要)的 ldr
和 str
命令/添加
.
现在的问题是,为什么同一个编译器为代码1产生如此差的汇编?所有这些不必要的 ldr
和 str
的原因是什么?
我检查了反汇编,因为您似乎拥有与我相同的开发环境:
LD2 {V0.4S-V1.4S},[src1],#0x20LD2 {V2.4S-V3.4S},[src2],#0x20SUB W8,W8,#8CMP W8,#8FMUL V4.4S,V3.4S,V1.4SFNEG V4.4S,V4.4SFMLA V4.4S,V0.4S,V2.4SFMUL V5.4S,V2.4S,V1.4SFMLA V5.4S,V0.4S,V3.4SST2 {V4.4S-V5.4S},[dst],#0x20B.GT loc_4C
两者都会生成相同的错误机器代码.
您为什么不发布拆装的产品?我的可能会略有不同,因为我不得不将参数转换为简单类型.(float *)
如果您的反汇编看起来相同,则必须是基准测试失败.没有其他解释.
更新:
在这种情况下,请排除所有不必要的内容:
像我一样将所有参数更改为简单的 float *
.
I have written a test app to compare c++ implementation and neon optimized implementation for multiplication of two vectors containing complex numbers.
The neon implementation is ~3x faster than cpp. (Code 1)
But if I replace neon intrinsic for multiplication - vmulq_f32
with multiplication operator *
to multiply two neon registers, I am getting a ~4x speed.
And then if I also replace neon intrinsic for add/subtract - vaddq_f32
/vsubq_f32
with +
/-
to add/subtract two neon registers, I am getting a ~5x speed. (Code 2)
I don't understand what's going on? Why are neon intrinsics slower than regular operators?
code 1 (~3x faster than cpp) -
// (a + ib) * (c + id) = (ac - bd) + i(ad + bc)
void complex_mult_neon(
std::vector<float>& inVec1,
std::vector<float>& inVec2,
std::vector<float>& outVec)
{
float* src1 = &inVec1[0];
float* src2 = &inVec2[0];
float* dst = &outVec[0];
float32x4x2_t reg_s1, reg_s2;
float32x4_t reg_p1, reg_p2;
float32x4x2_t reg_r;
for (auto count = inVec1.size(); count > 0; count -= 8)
{
reg_s1 = vld2q_f32(src1);
src1 += 8;
reg_s2 = vld2q_f32(src2);
src2 += 8;
// ac
reg_p1 = vmulq_f32(reg_s1.val[0], reg_s2.val[0]);
// bd
reg_p2 = vmulq_f32(reg_s1.val[1], reg_s2.val[1]);
// ac - bd
reg_r.val[0] = vsubq_f32(reg_p1, reg_p2);
// ad
reg_p1 = vmulq_f32(reg_s1.val[0], reg_s2.val[1]);
// bc
reg_p2 = vmulq_f32(reg_s1.val[1], reg_s2.val[0]);
// ad + bc
reg_r.val[1] = vaddq_f32(reg_p1, reg_p2);
vst2q_f32(dst, reg_r);
dst += 8;
}
}
code 2 (~5x faster than cpp) -
void complex_mult_neon(...)
{
// same as above ...
for (auto count = inVec1.size(); count > 0; count -= 8)
{
reg_s1 = vld2q_f32(src1);
src1 += 8;
reg_s2 = vld2q_f32(src2);
src2 += 8;
// ac
reg_p1 = reg_s1.val[0] * reg_s2.val[0];
// bd
reg_p2 = reg_s1.val[1] * reg_s2.val[1];
// ac - bd
reg_r.val[0] = reg_p1 - reg_p2;
// ad
reg_p1 = reg_s1.val[0] * reg_s2.val[1];
// bc
reg_p2 = reg_s1.val[1] * reg_s2.val[0];
// ad + bc
reg_r.val[1] = reg_p1 + reg_p2;
vst2q_f32(dst, reg_r);
dst += 8;
}
}
cpp code -
void complex_mult_cpp(
std::vector<float>& inVec1,
std::vector<float>& inVec2,
std::vector<float>& outVec)
{
float p1, p2;
for (auto i = 0; i < inVec1.size(); i += 2)
{
// ac
p1 = inVec1[i] * inVec2[i];
// bd
p2 = inVec1[i + 1] * inVec2[i + 1];
// ac - bd
outVec[i] = p1 - p2;
// ad
p1 = inVec1[i] * inVec2[i + 1];
// bc
p2 = inVec1[i + 1] * inVec2[i];
// ad + bc
outVec[i + 1] = p1 + p2;
}
}
Tools used - clang, ndk 16, Samsung S6 (AT&T)
EDIT - Adding disassembly as suggested
So I looked at disassembly for code 1 and code 2 -
Disassembly for code 1 (copied only the relevant portion between ld2
and st2
) -
88: 00 89 40 4c ld2 { v0.4s, v1.4s }, [x8]
8c: 22 1c a1 4e mov v2.16b, v1.16b
90: 03 1c a0 4e mov v3.16b, v0.16b
94: e8 07 40 f9 ldr x8, [sp, #8]
98: 03 55 80 3d str q3, [x8, #336]
9c: 02 59 80 3d str q2, [x8, #352]
a0: 02 55 c0 3d ldr q2, [x8, #336]
a4: 02 5d 80 3d str q2, [x8, #368]
a8: 02 59 c0 3d ldr q2, [x8, #352]
ac: 02 61 80 3d str q2, [x8, #384]
; outVec[i] = p1 - p2;
b0: 02 5d c0 3d ldr q2, [x8, #368]
b4: 02 75 80 3d str q2, [x8, #464]
b8: 02 61 c0 3d ldr q2, [x8, #384]
bc: 02 79 80 3d str q2, [x8, #480]
c0: e9 2b 40 f9 ldr x9, [sp, #80]
c4: 29 81 00 91 add x9, x9, #32
c8: e9 2b 00 f9 str x9, [sp, #80]
cc: e9 27 40 f9 ldr x9, [sp, #72]
d0: 20 89 40 4c ld2 { v0.4s, v1.4s }, [x9]
; p1 = inVec1[i] * inVec2[i + 1];
d4: 22 1c a1 4e mov v2.16b, v1.16b
d8: 03 1c a0 4e mov v3.16b, v0.16b
dc: 03 45 80 3d str q3, [x8, #272]
e0: 02 49 80 3d str q2, [x8, #288]
e4: 02 45 c0 3d ldr q2, [x8, #272]
e8: 02 4d 80 3d str q2, [x8, #304]
ec: 02 49 c0 3d ldr q2, [x8, #288]
f0: 02 51 80 3d str q2, [x8, #320]
f4: 02 4d c0 3d ldr q2, [x8, #304]
f8: 02 6d 80 3d str q2, [x8, #432]
fc: 02 51 c0 3d ldr q2, [x8, #320]
100: 02 71 80 3d str q2, [x8, #448]
104: e9 27 40 f9 ldr x9, [sp, #72]
108: 29 81 00 91 add x9, x9, #32
10c: e9 27 00 f9 str x9, [sp, #72]
; p2 = inVec1[i + 1] * inVec2[i];
110: 02 75 c0 3d ldr q2, [x8, #464]
114: 03 6d c0 3d ldr q3, [x8, #432]
118: e2 27 80 3d str q2, [sp, #144]
11c: e3 23 80 3d str q3, [sp, #128]
120: e2 27 c0 3d ldr q2, [sp, #144]
124: e3 23 c0 3d ldr q3, [sp, #128]
128: 42 dc 23 6e fmul v2.4s, v2.4s, v3.4s
12c: e2 1f 80 3d str q2, [sp, #112]
130: e2 1f c0 3d ldr q2, [sp, #112]
134: e2 0f 80 3d str q2, [sp, #48]
138: 02 79 c0 3d ldr q2, [x8, #480]
13c: 03 71 c0 3d ldr q3, [x8, #448]
140: 02 39 80 3d str q2, [x8, #224]
144: 03 35 80 3d str q3, [x8, #208]
148: 02 39 c0 3d ldr q2, [x8, #224]
; outVec[i + 1] = p1 + p2;
14c: 03 35 c0 3d ldr q3, [x8, #208]
150: 42 dc 23 6e fmul v2.4s, v2.4s, v3.4s
154: 02 31 80 3d str q2, [x8, #192]
158: 02 31 c0 3d ldr q2, [x8, #192]
15c: e2 0b 80 3d str q2, [sp, #32]
160: e2 0f c0 3d ldr q2, [sp, #48]
164: e3 0b c0 3d ldr q3, [sp, #32]
168: 02 2d 80 3d str q2, [x8, #176]
16c: 03 29 80 3d str q3, [x8, #160]
170: 02 2d c0 3d ldr q2, [x8, #176]
174: 03 29 c0 3d ldr q3, [x8, #160]
178: 42 d4 a3 4e fsub v2.4s, v2.4s, v3.4s
; for (auto i = 0; i < inVec1.size(); i += 2)
17c: 02 25 80 3d str q2, [x8, #144]
180: 02 25 c0 3d ldr q2, [x8, #144]
184: 02 65 80 3d str q2, [x8, #400]
188: 02 75 c0 3d ldr q2, [x8, #464]
;
18c: 03 71 c0 3d ldr q3, [x8, #448]
190: 02 21 80 3d str q2, [x8, #128]
194: 03 1d 80 3d str q3, [x8, #112]
198: 02 21 c0 3d ldr q2, [x8, #128]
19c: 03 1d c0 3d ldr q3, [x8, #112]
1a0: 42 dc 23 6e fmul v2.4s, v2.4s, v3.4s
1a4: 02 19 80 3d str q2, [x8, #96]
1a8: 02 19 c0 3d ldr q2, [x8, #96]
1ac: e2 0f 80 3d str q2, [sp, #48]
1b0: 02 79 c0 3d ldr q2, [x8, #480]
1b4: 03 6d c0 3d ldr q3, [x8, #432]
1b8: 02 15 80 3d str q2, [x8, #80]
1bc: 03 11 80 3d str q3, [x8, #64]
1c0: 02 15 c0 3d ldr q2, [x8, #80]
1c4: 03 11 c0 3d ldr q3, [x8, #64]
1c8: 42 dc 23 6e fmul v2.4s, v2.4s, v3.4s
1cc: 02 0d 80 3d str q2, [x8, #48]
1d0: 02 0d c0 3d ldr q2, [x8, #48]
1d4: e2 0b 80 3d str q2, [sp, #32]
1d8: e2 0f c0 3d ldr q2, [sp, #48]
1dc: e3 0b c0 3d ldr q3, [sp, #32]
1e0: 02 09 80 3d str q2, [x8, #32]
1e4: 03 05 80 3d str q3, [x8, #16]
1e8: 02 09 c0 3d ldr q2, [x8, #32]
1ec: 03 05 c0 3d ldr q3, [x8, #16]
1f0: 42 d4 23 4e fadd v2.4s, v2.4s, v3.4s
1f4: 02 01 80 3d str q2, [x8]
1f8: 02 01 c0 3d ldr q2, [x8]
1fc: 02 69 80 3d str q2, [x8, #416]
200: 02 65 c0 3d ldr q2, [x8, #400]
204: 02 3d 80 3d str q2, [x8, #240]
208: 02 69 c0 3d ldr q2, [x8, #416]
20c: 02 41 80 3d str q2, [x8, #256]
210: e9 23 40 f9 ldr x9, [sp, #64]
214: 02 3d c0 3d ldr q2, [x8, #240]
218: 03 41 c0 3d ldr q3, [x8, #256]
21c: 40 1c a2 4e mov v0.16b, v2.16b
220: 61 1c a3 4e mov v1.16b, v3.16b
224: 20 89 00 4c st2 { v0.4s, v1.4s }, [x9]
Disassembly for code 2 -
88: 00 89 40 4c ld2 { v0.4s, v1.4s }, [x8]
8c: 22 1c a1 4e mov v2.16b, v1.16b
90: 03 1c a0 4e mov v3.16b, v0.16b
94: e8 07 40 f9 ldr x8, [sp, #8]
98: 03 11 80 3d str q3, [x8, #64]
9c: 02 15 80 3d str q2, [x8, #80]
a0: 02 11 c0 3d ldr q2, [x8, #64]
a4: 02 19 80 3d str q2, [x8, #96]
a8: 02 15 c0 3d ldr q2, [x8, #80]
ac: 02 1d 80 3d str q2, [x8, #112]
; outVec[i] = p1 - p2;
b0: 02 19 c0 3d ldr q2, [x8, #96]
b4: 02 31 80 3d str q2, [x8, #192]
b8: 02 1d c0 3d ldr q2, [x8, #112]
bc: 02 35 80 3d str q2, [x8, #208]
c0: e9 2b 40 f9 ldr x9, [sp, #80]
c4: 29 81 00 91 add x9, x9, #32
c8: e9 2b 00 f9 str x9, [sp, #80]
cc: e9 27 40 f9 ldr x9, [sp, #72]
d0: 20 89 40 4c ld2 { v0.4s, v1.4s }, [x9]
; p1 = inVec1[i] * inVec2[i + 1];
d4: 22 1c a1 4e mov v2.16b, v1.16b
d8: 03 1c a0 4e mov v3.16b, v0.16b
dc: e3 27 80 3d str q3, [sp, #144]
e0: 02 05 80 3d str q2, [x8, #16]
e4: e2 27 c0 3d ldr q2, [sp, #144]
e8: 02 09 80 3d str q2, [x8, #32]
ec: 02 05 c0 3d ldr q2, [x8, #16]
f0: 02 0d 80 3d str q2, [x8, #48]
f4: 02 09 c0 3d ldr q2, [x8, #32]
f8: 02 29 80 3d str q2, [x8, #160]
fc: 02 0d c0 3d ldr q2, [x8, #48]
100: 02 2d 80 3d str q2, [x8, #176]
104: e9 27 40 f9 ldr x9, [sp, #72]
108: 29 81 00 91 add x9, x9, #32
10c: e9 27 00 f9 str x9, [sp, #72]
; p2 = inVec1[i + 1] * inVec2[i];
110: 02 31 c0 3d ldr q2, [x8, #192]
114: 03 29 c0 3d ldr q3, [x8, #160]
118: 42 dc 23 6e fmul v2.4s, v2.4s, v3.4s
11c: e2 0f 80 3d str q2, [sp, #48]
120: 02 35 c0 3d ldr q2, [x8, #208]
124: 03 2d c0 3d ldr q3, [x8, #176]
128: 42 dc 23 6e fmul v2.4s, v2.4s, v3.4s
12c: e2 0b 80 3d str q2, [sp, #32]
130: e2 0f c0 3d ldr q2, [sp, #48]
134: e3 0b c0 3d ldr q3, [sp, #32]
138: 42 d4 a3 4e fsub v2.4s, v2.4s, v3.4s
13c: 02 21 80 3d str q2, [x8, #128]
140: 02 31 c0 3d ldr q2, [x8, #192]
144: 03 2d c0 3d ldr q3, [x8, #176]
148: 42 dc 23 6e fmul v2.4s, v2.4s, v3.4s
; outVec[i + 1] = p1 + p2;
14c: e2 0f 80 3d str q2, [sp, #48]
150: 02 35 c0 3d ldr q2, [x8, #208]
154: 03 29 c0 3d ldr q3, [x8, #160]
158: 42 dc 23 6e fmul v2.4s, v2.4s, v3.4s
15c: e2 0b 80 3d str q2, [sp, #32]
160: e2 0f c0 3d ldr q2, [sp, #48]
164: e3 0b c0 3d ldr q3, [sp, #32]
168: 42 d4 23 4e fadd v2.4s, v2.4s, v3.4s
16c: 02 25 80 3d str q2, [x8, #144]
170: 02 21 c0 3d ldr q2, [x8, #128]
174: e2 1f 80 3d str q2, [sp, #112]
178: 02 25 c0 3d ldr q2, [x8, #144]
; for (auto i = 0; i < inVec1.size(); i += 2)
17c: e2 23 80 3d str q2, [sp, #128]
180: e9 23 40 f9 ldr x9, [sp, #64]
184: e2 1f c0 3d ldr q2, [sp, #112]
188: e3 23 c0 3d ldr q3, [sp, #128]
;
18c: 40 1c a2 4e mov v0.16b, v2.16b
190: 61 1c a3 4e mov v1.16b, v3.16b
194: 20 89 00 4c st2 { v0.4s, v1.4s }, [x9]
Disassembly does explain the reason for speed up. Notice how in first code, their are so many (seemingly unnecessary) ldr
and str
commands between fmul
and fmul
/fadd
.
Now the question is why does same compiler produce such poor assembly for code 1? What is the reason for all these unnecessary ldr
and str
?
I checked the disassembly since you seem to have the same develop environment I have:
LD2 {V0.4S-V1.4S}, [src1],#0x20
LD2 {V2.4S-V3.4S}, [src2],#0x20
SUB W8, W8, #8
CMP W8, #8
FMUL V4.4S, V3.4S, V1.4S
FNEG V4.4S, V4.4S
FMLA V4.4S, V0.4S, V2.4S
FMUL V5.4S, V2.4S, V1.4S
FMLA V5.4S, V0.4S, V3.4S
ST2 {V4.4S-V5.4S}, [dst],#0x20
B.GT loc_4C
Both generate the same bad machine codes.
Why don't you post the disassembly of yours? Mine might be slightly different since I had to convert the parameters to simple types. (float *)
If your disassembly looks the same, it must be benchmarking failure. There is no other explanation.
update:
In this case, rule out everything unnecessary:
Change all arguments to simple float *
like I did.
这篇关于为什么乘法的氖本征函数比加法运算符慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!