每条汇编指令需要多少个 CPU 周期? [英] How many CPU cycles are needed for each assembly instruction?

查看:29
本文介绍了每条汇编指令需要多少个 CPU 周期?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我听说网上有一本 Intel 的书,它描述了特定汇编指令所需的 CPU 周期,但我找不到(经过努力).谁能告诉我如何找到 CPU 周期?

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?

举个例子,在下面的代码中,mov/lock是1个CPU周期,xchg是3个CPU周期.

Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.

// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress, 
                                              int nValue)
{
    __asm
    {
        mov edx, dword ptr [pTargetAddress]
        mov eax, nValue
        lock xchg eax, dword ptr [edx]
    }
    // mov = 1 CPU cycle
    // lock = 1 CPU cycle
    // xchg = 3 CPU cycles
}

#endif // WIN32

顺便说一句:这是我发布的代码的 URL:http://www.codeproject.com/KB/threads/spinlocks.aspx

BTW: here is the URL for the code I posted: http://www.codeproject.com/KB/threads/spinlocks.aspx

推荐答案

现代 CPU 是复杂的野兽,使用 pipelining超标量执行乱序执行 以及其他使性能分析变得困难的技术......但并非不可能

Modern CPUs are complex beasts, using pipelining, superscalar execution, and out-of-order execution among other techniques which make performance analysis difficult... but not impossible!

虽然您不能再简单地将指令流的延迟加在一起来获得总运行时间,但您仍然可以(通常)对某些代码(尤其是循环)的行为进行高度准确的分析,如所述下方和其他链接资源中.

While you can no longer simply add together the latencies of a stream of instructions to get the total runtime, you can still get a (often) highly accurate analysis of the behavior of some piece of code (especially a loop) as described below and in other linked resources.

首先,您需要实际时间.这些因 CPU 架构而异,但目前用于 x86 时序的最佳资源是 Agner Fog 的 指令表.这些表格涵盖不少于 三十 种不同的微架构,列出了指令延迟,这是指令从准备好输入到可用输出所需的最短/典型时间.用阿格纳的话来说:

First, you need the actual timings. These vary by CPU architecture, but the best resource currently for x86 timings is Agner Fog's instruction tables. Covering no less than thirty different microarchitecures, these tables list the instruction latency, which is the minimum/typical time that an instruction takes from inputs ready to output available. In Agner's words:

延迟:这是指令在一个依赖链.数字是最小值.缓存未命中,未对齐,异常可能会增加时钟计数相当.在启用超线程的情况下,使用相同的另一个线程中的执行单元会导致性能下降.非正规数、NAN 和无穷大不会增加延迟.这使用的时间单位是核心时钟周期,而不是参考时钟周期由时间戳计数器给出.

Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

因此,例如,add 指令有一个周期的延迟,因此一系列依赖 add 指令,如图所示,将有一个周期的延迟每 add:

So, for example, the add instruction has a latency of one cycle, so a series of dependent add instructions, as shown, will have a latency of 1 cycle per add:

add eax, eax
add eax, eax
add eax, eax
add eax, eax  # total latency of 4 cycles for these 4 adds

请注意,这并不意味着 add 指令每个只需要 1 个周期.例如,如果加法指令依赖,那么在现代芯片上,所有 4 条加法指令可能都可以在同一个周期内独立执行:

Note that this doesn't mean that add instructions will only take 1 cycle each. For example, if the add instructions were not dependent, it is possible that on modern chips all 4 add instructions can execute independently in the same cycle:

add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle

Agner 提供了一个度量来捕捉这种潜在的并行性,称为互惠吞吐量:

Agner provides a metric which captures some of this potential parallelism, called reciprocal throughput:

相互吞吐量:对于一系列相同类型的独立指令,每条指令的平均核心时钟周期数在同一个线程中.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

对于 add 这被列为 0.25 意味着每个周期最多可以执行 4 个 add 指令(给 1/4 = 0.25).

For add this is listed as 0.25 meaning that up to 4 add instructions can execute every cycle (giving a reciprocal throughput of 1 / 4 = 0.25).

吞吐量的倒数也暗示了指令的流水线能力.例如,在最近的 x86 芯片上,imul 指令的常见形式有 3 个周期的延迟,并且内部只有一个执行单元可以处理它们(不像 add通常有四个可添加的单元).然而,对于一长串独立的 imul 指令,观察到的吞吐量是 1 个/周期,而不是每 3 个周期 1 个,因为延迟为 3.原因是 imulcode> 单元是流水线式的:它可以启动一个新的imul 每个循环,即使之前的乘法还没有完成.

The reciprocal throughput number also gives a hint at the pipelining capability of an instruction. For example, on most recent x86 chips, the common forms of the imul instruction have a latency of 3 cycles, and internally only one execution unit can handle them (unlike add which usually has four add-capable units). Yet the observed throughput for a long series of independent imul instructions is 1/cycle, not 1 every 3 cycles as you might expect given the latency of 3. The reason is that the imul unit is pipelined: it can start a new imul every cycle, even while the previous multiplication hasn't completed.

这意味着一系列独立 imul指令每个周期最多可以运行1条,但是一系列依赖 imul 指令将每 3 个周期仅运行 1 个(因为下一个 imul 在前一个的结果准备好之前无法启动).

This means a series of independent imul instructions can run at up to 1 per cycle, but a series of dependent imul instructions will run at only 1 every 3 cycles (since the next imul can't start until the result from the prior one is ready).

因此,有了这些信息,您就可以开始了解如何分析现代 CPU 上的指令时序了.

So with this information, you can start to see how to analyze instruction timings on modern CPUs.

不过,以上只是触及了表面.您现在可以通过多种方式查看一系列指令(延迟或吞吐量),但可能不清楚使用哪种方式.

Still, the above is only scratching the surface. You now have multiple ways of looking at a series of instructions (latency or throughput) and it may not be clear which to use.

此外,还有其他一些限制没有被上述数字所捕获,例如某些指令在 CPU 内竞争相同资源的事实,以及 CPU 流水线其他部分(例如指令解码)的限制可能导致总吞吐量比您仅通过查看延迟和吞吐量计算得出的要低.除此之外,您还有超出 ALU"的因素,例如内存访问和分支预测:整个主题本身 - 您大多可以很好地对这些进行建模,但这需要工作.例如,这里有一篇最近的帖子,其中的答案详细涵盖了大部分相关因素.

Furthermore, there are other limits not captured by the above numbers, such as the fact that certain instructions compete for the same resources within the CPU, and restrictions in other parts of the CPU pipeline (such as instruction decoding) which may result in a lower overall throughput than you'd calculate just by looking at latency and throughput. Beyond that, you have factors "beyond the ALUs" such as memory access and branch prediction: entire topics unto themselves - you can mostly model these well, but it takes work. For example here's a recent post where the answer covers in some detail most of the relevant factors.

涵盖所有细节会使这个已经很长的答案的大小增加 10 倍或更多,因此我只会为您提供最佳资源.Agner Fog 有一个Optimizing Asembly指南,其中详细介绍了精确的分析一个包含十几个指令的循环.请参阅12.7矢量循环瓶颈分析示例",该示例从当前 PDF 版本的第 95 页开始.

Covering all the details would increase the size of this already long answer by a factor of 10 or more, so I'll just point you to the best resources. Agner Fog has an Optimizing Asembly guide that covers in detail the precise analysis of a loop with a dozen or so instructions. See "12.7 An example of analysis for bottlenecks in vector loops" which starts on page 95 in the current version of the PDF.

基本思想是创建一个表,每条指令一行,并标记每个使用的执行资源.这让您可以看到任何吞吐量瓶颈.此外,您需要检查循环中是否存在携带依赖项,以查看其中是否有任何限制了吞吐量(请参阅12.16 分析依赖项"以了解复杂情况).

The basic idea is that you create a table, with one row per instruction and mark the execution resources each uses. This lets you see any throughput bottlenecks. In addition, you need to examine the loop for carried dependencies, to see if any of those limit the throughput (see "12.16 Analyzing dependencies" for a complex case).

如果您不想手动完成,英特尔已经发布了 英特尔架构代码分析器,这是一种自动执行此分析的工具.它目前尚未在 Skylake 之外进行更新,但 Kaby Lake 的结果在很大程度上仍然合理,因为微体系结构没有太大变化,因此时间保持可比性.这个答案有很多细节并提供了示例输出,用户指南 还不错(虽然它已经出来了最新版本的日期).

If you don't want to do it by hand, Intel has released the Intel Architecture Code Analyzer, which is a tool that automates this analysis. It currently hasn't been updated beyond Skylake, but the results are still largely reasonable for Kaby Lake since the microarchitecture hasn't changed much and therefore the timings remain comparable. This answer goes into a lot of detail and provides example output, and the user's guide isn't half bad (although it is out of date with respect to the newest versions).

Agner 通常会在新架构发布后不久为其提供时间安排,但您也可以查看 instlatx64InstLatX86InstLatX64 结果中的时间安排类似.结果涵盖了很多有趣的旧芯片,而新芯片通常会很快出现.结果与 Agner 的结果基本一致,但这里和那里也有一些例外.您还可以在此页面上找到内存延迟和其他值.

Agner usually provides timings for new architectures shortly after they are released, but you can also check out instlatx64 for similarly organized timings in the InstLatX86 and InstLatX64 results. The results cover a lot of interesting old chips, and new chips usually show up fairly quickly. The results are mostly consistent with Agner's, with a few exceptions here and there. You can also find memory latency and other values on this page.

您甚至可以直接从英特尔的 IA32 和 Intel 64 优化手册,位于附录 C:指令延迟和吞吐量.我个人更喜欢 Agner 的版本,因为它们更完整,通常在 Intel 手册更新之前到达,并且更易于使用,因为它们提供了电子表格和 PDF 版本.

You can even get the timing results directly from Intel in their IA32 and Intel 64 optimization manual in Appendix C: INSTRUCTION LATENCY AND THROUGHPUT. Personally I prefer Agner's version because they are more complete, often arrive before the Intel manual is updated, and are easier to use as they provide a spreadsheet and PDF version.

最后,x86 标签维基有大量关于 x86 优化的资源,包括指向其他​​示例的链接对代码序列做一个循环准确的分析.

Finally, the x86 tag wiki has a wealth of resources on x86 optimization, including links to other examples of how to do a cycle accurate analysis of code sequences.

如果您想更深入地了解上述数据流分析"类型,我建议您数据流图的旋风介绍.

If you want a deeper look into the type of "dataflow analysis" described above, I would recommend A Whirlwind Introduction to Data Flow Graphs.

这篇关于每条汇编指令需要多少个 CPU 周期?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆