浮点加法与浮点乘法的相对速度是多少? [英] What's the relative speed of floating point add vs. floating point multiply

查看:365
本文介绍了浮点加法与浮点乘法的相对速度是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几十年前,写数字代码是值得的,以避免使用乘法和除法,而是使用加法和减法。一个很好的例子是使用转发差异来评估多项式曲线,而不是直接计算多项式。



这是否仍然如此,或者现代计算机体系结构提前到*,/不再比+, - ? b
$ b

具体而言,我对运行在现代典型x86芯片上的编译C / C ++代码感兴趣,这些代码有大量的板上浮点硬件,而不是试图在软件中执行FP的小型微代码。我意识到流水线和其他架构增强功能排除了具体的周期计数,但我仍然希望得到一个有用的直觉。

解决方案

这也取决于指令组合。你的处理器将有几个计算单元随时待命,如果所有的计算单元都被填满,你将获得最大的吞吐量。所以,执行一个循环的mul就像执行一个循环或者增加一样快 - 但是如果表达式变得更加复杂的话,这个循环也是不成立的。



例如, (int j = 0; j for(int i) = 1; i< NUMEL; i ++){
bla + = 2.1 + arr1 [i] + arr2 [i] + arr3 [i] + arr4 [i]


$ / code> $ / pre>

对于NUMITER = 10 ^ 7,NUMEL = 10 ^ 2,这两个数组初始化为小正数(NaN慢得多),这需要6.0秒使用64位处理器双打。如果我用

  bla + = 2.1 * arr1 [i] + arr2 [i] + arr3 [i] *替换循环arr4 [i]; 

这只需要1.7秒...所以,由于我们过分自由;减少增加帮助。它得到的更加混乱:

  BLA + = 2.1 + ARR1 [I] * ARR2 [I] + ARR3 [I] * arr4 [一世] ; 

- 相同的mul / add分布,但是现在常数被加入而不是乘以 - 需要3.7秒。您的处理器可能被优化以更有效地执行典型的数值计算;所以像摩尔和比例和的总和就像它的总和一样好。添加常量并不常见,因此速度较慢...

  bla + = someval + arr1 [i] * arr2 [i] + arr3 [i] * arr4 [i]; / * someval == 2.1 * / 

再次需要1.7秒。

  bla + = someval + arr1 [i] + arr2 [i] + arr3 [i] + arr4 [i]; / * someval == 2.1 * / 

(与初始循环相同,但没有昂贵的常量加法:2.1秒)

  bla + = someval * arr1 [i] * arr2 [i] * arr3 [i] * arr4 [i] ; / * someval == 2.1 * / 

(大部分是muls,但是增加一个:1.9秒) p>

所以,基本上;很难说哪个更快,但是如果你想避免瓶颈,更重要的是要有一个合理的组合,避免NaN或INF,避免添加常量。无论你做什么,都要确保你测试了,并测试了各种编译器设置,因为通常很小的改变可以改变它们。



更多的情况:

  bla * = someval; // someval非常接近1.0;需要2.1秒
bla * = arr1 [i]; // arr1 [i]都非常接近1.0;需要66(!)秒
bla + = someval + arr1 [i] * arr2 [i] + arr3 [i] * arr4 [i]; // 1.6秒
bla + = someval + arr1 [i] * arr2 [i] + arr3 [i] * arr4 [i]; // 32位模式,2.2秒
bla + = someval + arr1 [i] * arr2 [i] + arr3 [i] * arr4 [i]; // 32位模式下,浮筒2.2秒
BLA + = someval * ARR1 [I] * ARR2 [I]; // 0.9在64位,在1.6 86
BLA + = someval * ARR1 [I ]; // 0.55在64位,在0.8 86
BLA + = ARR1 [I] * ARR2 [I]; // 0.8在64位,在0.8的x86,0.95在CLR + 64位,在0.8 + CLR 86


A decade or two ago, it was worthwhile to write numerical code to avoid using multiplies and divides and use addition and subtraction instead. A good example is using forward differences to evaluate a polynomial curve instead of computing the polynomial directly.

Is this still the case, or have modern computer architectures advanced to the point where *,/ are no longer many times slower than +,- ?

To be specific, I'm interested in compiled C/C++ code running on modern typical x86 chips with extensive on-board floating point hardware, not a small micro trying to do FP in software. I realize pipelining and other architectural enhancements preclude specific cycle counts, but I'd still like to get a useful intuition.

解决方案

It also depends on instruction mix. Your processor will have several computation units standing by at any time, and you'll get maximum throughput if all of them are filled all the time. So, executing a loop of mul's is just as fast as executing a loop or adds - but the same doesn't hold if the expression becomes more complex.

For example, take this loop:

for(int j=0;j<NUMITER;j++) {
  for(int i=1;i<NUMEL;i++) {
    bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ;
  }
}

for NUMITER=10^7, NUMEL=10^2, both arrays initialized to small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles on a 64-bit proc. If I replace the loop with

bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;

It only takes 1.7 seconds... so since we "overdid" the additions, the muls were essentially free; and the reduction in additions helped. It get's more confusing:

bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;

-- same mul/add distribution, but now the constant is added in rather than multiplied in -- takes 3.7 seconds. Your processor is likely optimized to perform typical numerical computations more efficiently; so dot-product like sums of muls and scaled sums are about as good as it gets; adding constants isn't nearly as common, so that's slower...

bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/

again takes 1.7 seconds.

bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/

(same as initial loop, but without expensive constant addition: 2.1 seconds)

bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/

(mostly muls, but one addition:1.9 seconds)

So, basically; it's hard to say which is faster, but if you wish to avoid bottlenecks, more important is to have a sane mix, avoid NaN or INF, avoid adding constants. Whatever you do, make sure you test, and test various compiler settings, since often small changes can just make the difference.

Some more cases:

bla *= someval; // someval very near 1.0; takes 2.1 seconds
bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds
bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86
bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86
bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86

这篇关于浮点加法与浮点乘法的相对速度是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆