AArch64 上 sqrt 函数的性能 [英] Performance of sqrt function on AArch64

查看:76
本文介绍了AArch64 上 sqrt 函数的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于学术原因,我将在 AArch64 上测试 sqrt 函数的性能.单浮点sqrtf函数的代码:

fsqrt s0, s0退

双浮点sqrt函数的代码:

fsqrt d0, d0退

我指的是此处 FSQRT 的理论延迟:

单个 sqrt 似乎比双 sqrt 好 2 倍.

但是,在分析时我得到了这些数字:

326 ms sqrt82 毫秒

我正在使用相同数量的周期.从这些数字来看,sqrtf 似乎要好 4 倍.

我找不到正确的原因是什么?无法在互联网上找到有关此指令实际情况的正确解释.

这方面的一些信息或方向会非常有用.

解决方案

如果您查看 Cortex-A57 优化指南中 FSQRT 指令的表条目所附的注释,它说FP 除法和平方根使用迭代算法执行操作".

这意味着根据指令的输入,延迟会有所不同.这就是表中7-17"和7-32"延迟数字的含义.根据输入,单精度 FSQRT 可能需要 7 到 17 个周期才能完成,而双精度变体可能需要 7 到 32 个周期.

因此,如果一个特定的单精度计算恰好需要 7 个周期,而一个双精度计算需要,比如说,28 个周期,您就有 4 倍的差异.

I'm taking the performance of sqrt function on AArch64 for academic reasons. Code for Single float sqrtf function:

fsqrt s0, s0 
ret

Code for Double float sqrt function:

fsqrt d0, d0 
ret

I'm referring to theoretical latencies for FSQRT from here: http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf

Single sqrt seems 2x better than double.

But, while profiling I'm getting these numbers:

326 ms  sqrt
 82 ms  sqrtf

I'm taking times for same number of cycles. From those numbers, sqrtf seems 4x better.

I'm not able find proper reason why? Not able to find proper explanations about how actually this instruction on internet.

Some info or direction on this would be really useful.

解决方案

If you look at the note attached to the table entries for the FSQRT instruction in the Cortex-A57 optimization guide, it says that the "FP divide and square root operations are performed using an iterative algorithm".

That means that depending on the input to the instruction, the latency will vary. That is the meaning of the "7-17" and "7-32" latency numbers in the table. Depending on the input the single-precision FSQRT can take between 7 and 17 cycles to complete whereas the double-precision variant can take between 7 and 32 cycles.

So if a particular single-precision computation happens to take 7 cycles but a double precision computation takes, say, 28 cycles you have a 4x disparity.

这篇关于AArch64 上 sqrt 函数的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆