(n-乘法)与(n/2-乘法+ 2加法)哪个更好? [英] (n - Multiplication) vs (n/2 - multiplication + 2 additions) which is better?

查看:135
本文介绍了(n-乘法)与(n/2-乘法+ 2加法)哪个更好?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个C语言程序,该程序具有n个乘法(具有n次迭代的单次乘法),并且发现了另一个具有n/2次迭代(1个乘法+ 2个加法)的逻辑.我知道两者都是O(n)的复杂性.但就CPU周期而言.哪个更快?

I have a C program that has n multiplications (single multiplication with n iterations) and I found another logic that has n/2 iterations of (1 multiplication + 2 additions). I know about the complexity that both are of O(n). but in terms of CPU cycles. which is faster ?

推荐答案

首先遵循Dietrich Epp的第一个建议-测量(至少对于复杂的优化问题而言)是确定的唯一方法.

First of all follow Dietrich Epp's first advice - measuring is (at least for complex optimization problems) the only way to be sure.

现在,如果您想弄清楚为什么一个比另一个要快,我们可以尝试.有两种不同的重要性能指标:延迟和互惠吞吐量.两者的简短摘要:

Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:

延迟:这是指令在 依赖链.这些数字是最小值.缓存未命中, 对齐错误和异常可能会增加时钟计数 相当.启用超线程的地方,使用相同的 其他线程中的执行单元会导致性能下降. 异常数,NAN和无穷大不会增加延迟.这 使用的时间单位是核心时钟周期,而不是参考时钟周期 由时间戳记计数器给出.

Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN’s and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

倒数吞吐量:每个核心时钟周期的平均数量 一系列相同类型的独立指令的指令 在同一线程中.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

对于桑迪·布里奇来说add r, r/i的吞吐量(需要进一步注意r = register,i = immediate,m =内存)为0.33,而延迟为1.

For Sandy bridge the rec. throughput for an add r, r/i (for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.

imul r, r的等待时间为3,rec.吞吐量为1.

An imul r, r has a latency of 3 and a rec. throughput of 1.

因此,正如您所看到的,它完全取决于您的特定算法-如果您可以将一个imul替换为两个独立的加法,则算法的这一特定部分理论上可以获得50%的加速(并且在最佳情况下,显然可以达到50%的加速) 〜350%).但是另一方面,如果您的添加项添加了有问题的依赖关系,那么一个imul可能与一个添加项一样快.

So as you see it completely depends on your specific algorithm - if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.

还请注意,我们已经忽略了所有其他复杂性,例如内存和缓存行为(通常对执行时间有很大影响的事情)或诸如µop融合之类的复杂东西.通常,唯一需要关心这些内容的人是编译器编写者-只需衡量他们的努力结果,这就会简单得多;)

Also note that we've ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers - it's much simpler to just measure the result of their efforts ;)

无论如何,如果您想很好地列出这些东西,请参见此处(以上延迟/记录吞吐量的描述也来自该特定文档.

Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).

这篇关于(n-乘法)与(n/2-乘法+ 2加法)哪个更好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆