浮点加法与浮点乘法的相对速度是多少 [英] What's the relative speed of floating point add vs. floating point multiply

查看:21
本文介绍了浮点加法与浮点乘法的相对速度是多少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一两年前,编写数字代码以避免使用乘法和除法而使用加法和减法是值得的.一个很好的例子是使用前向差异来评估多项式曲线,而不是直接计算多项式.

A decade or two ago, it was worthwhile to write numerical code to avoid using multiplies and divides and use addition and subtraction instead. A good example is using forward differences to evaluate a polynomial curve instead of computing the polynomial directly.

情况是否仍然如此,或者现代计算机架构已经发展到 *,/不再比 +,- 慢很多倍的地步?

Is this still the case, or have modern computer architectures advanced to the point where *,/ are no longer many times slower than +,- ?

具体来说,我对在具有大量板载浮点硬件的现代典型 x86 芯片上运行的已编译 C/C++ 代码感兴趣,而不是试图在软件中执行 FP 的小型微型计算机.我意识到流水线和其他架构增强排除了特定的周期计数,但我仍然想获得有用的直觉.

To be specific, I'm interested in compiled C/C++ code running on modern typical x86 chips with extensive on-board floating point hardware, not a small micro trying to do FP in software. I realize pipelining and other architectural enhancements preclude specific cycle counts, but I'd still like to get a useful intuition.

推荐答案

这也取决于指令组合.您的处理器将有多个计算单元随时待命,如果所有计算单元一直被填满,您将获得最大吞吐量.因此,执行 mul 的循环与执行循环或添加的速度一样快 - 但如果表达式变得更复杂,则相同.

It also depends on instruction mix. Your processor will have several computation units standing by at any time, and you'll get maximum throughput if all of them are filled all the time. So, executing a loop of mul's is just as fast as executing a loop or adds - but the same doesn't hold if the expression becomes more complex.

以这个循环为例:

for(int j=0;j<NUMITER;j++) {
  for(int i=1;i<NUMEL;i++) {
    bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ;
  }
}

对于 NUMITER=10^7,NUMEL=10^2,两个数组都初始化为小的正数(NaN 慢得多),在 64 位 proc 上使用双精度数需要 6.0 秒.如果我用

for NUMITER=10^7, NUMEL=10^2, both arrays initialized to small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles on a 64-bit proc. If I replace the loop with

bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;

它只需要 1.7 秒......所以由于我们过度"添加,muls 基本上是免费的;增加的减少有所帮助.它变得更加混乱:

It only takes 1.7 seconds... so since we "overdid" the additions, the muls were essentially free; and the reduction in additions helped. It get's more confusing:

bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;

-- 相同的 mul/add 分布,但现在常数被加入而不是乘以 - 需要 3.7 秒.您的处理器可能已优化为更有效地执行典型的数值计算;所以像 muls 的总和和缩放的总和这样的点积几乎是最好的;添加常量并不常见,所以速度较慢...

-- same mul/add distribution, but now the constant is added in rather than multiplied in -- takes 3.7 seconds. Your processor is likely optimized to perform typical numerical computations more efficiently; so dot-product like sums of muls and scaled sums are about as good as it gets; adding constants isn't nearly as common, so that's slower...

bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/

再次需要 1.7 秒.

again takes 1.7 seconds.

bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/

(与初始循环相同,但没有昂贵的常量添加:2.1 秒)

(same as initial loop, but without expensive constant addition: 2.1 seconds)

bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/

(主要是 muls,但增加了一个:1.9 秒)

(mostly muls, but one addition:1.9 seconds)

所以,基本上;很难说哪个更快,但如果你想避免瓶颈,更重要的是要有一个理智的组合,避免 NaN 或 INF,避免添加常量.无论您做什么,请确保您测试并测试各种编译器设置,因为通常很小的更改就会产生影响.

So, basically; it's hard to say which is faster, but if you wish to avoid bottlenecks, more important is to have a sane mix, avoid NaN or INF, avoid adding constants. Whatever you do, make sure you test, and test various compiler settings, since often small changes can just make the difference.

更多案例:

bla *= someval; // someval very near 1.0; takes 2.1 seconds
bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds
bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86
bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86
bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86

这篇关于浮点加法与浮点乘法的相对速度是多少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆