估算每条指令的周期 [英] Estimating Cycles Per Instruction

查看：96 发布时间：2021/4/9 19:05:22 performance assembly architecture x86 sse

本文介绍了估算每条指令的周期的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经分解了一个用MSVC v140编译的小型C ++程序，并试图估计每条指令的周期，以便更好地了解代码设计如何影响性能.我一直在面向数据的设计和C ++" 上关注Mike Acton的CppCon 2014演讲，特别是我链接到的部分.

I have disassembled a small C++ program compiled with MSVC v140 and am trying to estimate the cycles per instruction in order to better understand how code design impacts performance. I've been following Mike Acton's CppCon 2014 talk on "Data-Oriented Design and C++", specifically the portion I've linked to.

他在其中指出了以下几行:

In it, he points out these lines:

movss   8(%rbx), %xmm1
movss   12(%rbx), %xmm0

然后，他声称这些 2 x 32位读取可能在同一高速缓存行上，因此大约需要200个周期.

《 Intel 64和IA-32架构优化参考手册》是一个很好的资源，特别是附录C-指令延迟和吞吐量" .但是，在C-15页上的表C-16.SIMD流扩展单精度浮点指令" 中，它声明移动只有1个周期(除非我我了解延迟在这里意味着什么错...如果是这样，我该怎么读?)

The Intel 64 and IA-32 Architectures Optimization Reference Manual has been a great resource, specifically "Appendix C - Instruction Latency and Throughput". However on page C-15 in "Table C-16. Streaming SIMD Extension Single-precision Floating-point Instructions" it states that movss is only 1 cycle (unless I'm understanding what latency means here wrong... if so, how do I read this thing?)

我知道执行时间的理论预测永远不会是正确的，但是这很重要. 这两个命令是200个周期如何?我如何学会推断超出此片段的执行时间?

I know that a theoretical prediction of execution time will never be correct, but nevertheless this is important to learn. How are these two commands 200 cycles, and how can I learn to reason about execution time beyond this snippet?

我已经开始阅读有关CPU流水线的一些东西了……也许大多数周期都在那儿拾取?

I've started to read some things on CPU pipelining... maybe the majority of the cycles are being picked up there?

PS:我对在这里实际测量硬件性能计数器不感兴趣.我只是想学习如何合理地阅读ASM和周期.

PS: I'm not interested in actually measuring hardware performance counters here. I'm just looking to learn how to reasonable sight read ASM and cycles.

推荐答案

您已经指出，MOVSS指令的理论吞吐量和等待时间为1个周期.您正在查看正确的文档(指令表中测量了相同的数字(AMD具有更高的延迟).

As you already pointed out, the theoretical throughput and latency of a MOVSS instruction is at 1 cycle. You were looking at the right document (Intel Optimization Manual). Agner Fog (mentioned in the comments) measured the same numbers in his Intruction Tables for Intel CPUs (AMD is has a higher latency).

这导致我们遇到第一个问题:您正在研究哪种特定的微体系结构?即使对于同一供应商，这也可以带来很大的不同.Agner Fog报告说，MOVSS在AMD Bulldozer上具有2-6cy的延迟，具体取决于源和目标(寄存器与内存).在研究计算机体系结构的性能时，请记住这一点.

This leads us to the first problem: What specific microarchitecture are you investigating? This can make a big difference, even for the same vendor. Agner Fog reports that MOVSS has a 2-6cy latency on AMD Bulldozer depending on the source and destination (register vs memory). This is important to keep in mind when looking into performance of computer architectures.

200cy最有可能是高速缓存未命中，正如已经在注释中指出的那样.您从《优化手册》中获得的有关任何内存访问指令的数字都是在数据驻留在一级缓存(L1)中的前提下得出的.现在，如果您从未按先前的说明操作过数据，则需要将高速缓存行(使用Intel和AMD x86时为64字节)从内存加载到最后一级的缓存中，再从那里加载到第二级缓存中，然后再加载到L1中，最后进入XMM寄存器(在1个周期内).在L3-L2和L2-L1之间进行传输时，在当前Intel微体系结构上，每个高速缓存行的吞吐量(不是延迟！)为两个周期.并且内存带宽可用于估计L3和内存之间的吞吐量(例如，具有40 GB/s的可实现内存带宽的2 GHz CPU将具有每条高速缓存行3.2个周期的吞吐量).高速缓存行或内存块通常是最小的单元高速缓存，并且可以在其上操作的内存，它们在微体系结构之间有所不同，甚至在体系结构内也可能有所不同，具体取决于高速缓存级别(L1，L2等).

The 200cy are most likely cache misses, as already pointed out be dwelch in the comments. The numbers you get from the Optimization Manual for any memory accessing instructions are all under the assumption that the data resides in the first level cache (L1). Now, if you have never touched the data by previous instructions the cache line (64 bytes with Intel and AMD x86) will need to be loaded from memory into the last level cache, form there into the second level cache, then into L1 and finally into the XMM register (within 1 cycle). Transfers between L3-L2 and L2-L1 have a throughput (not latency!) of two cycles per cache line on current Intel microarchitectures. And the memory bandwidth can be used to estimate the throughput between L3 and memory (e.g, a 2 GHz CPU with an achievable memory bandwidth of 40 GB/s will have a throughput of 3.2 cycles per cache line). Cache lines or memory blocks are typically the smallest unit caches and memory can operate on, they differ between microarchitectures and may even be different within the architecture, depending on the cache level (L1, L2 and so on).

现在，这是全部吞吐量，而不是延迟，这将无法帮助您估算上述内容.为了验证这一点，您需要一遍又一遍地执行指令(至少1/10s)以获得周期精确的测量值.通过更改指令，您可以决定是否要测量延迟(通过包括指令之间的依赖性)或吞吐量(通过使指令输入独立于先前指令的结果)来进行测量.要测量缓存和内存访问，您需要预测访问是否将访问缓存，这可以使用图层条件.

Now this is all throughput and not latency, which will not help you estimate what you have described above. To verify this, you would need to execute the instructions over and over (for at least 1/10s) to get cycle accurate measurements. By changing the instructions you can decide if you want to measure for latency (by including dependencies between instructions) or throughput (by having the instructions input independent of the result of previous instructions). To measure for caches and memory accesses you would need to predict if an access is going to a cache or not, this can be done using layer conditions.

英特尔架构代码分析器，它支持Haswell之前的多个微体系结构.延迟预测要用盐来衡量，因为估计延迟比吞吐量要难得多.

A tool to estimate instruction execution (both latency and throughput) for Intel CPUs is the Intel Architecture Code Analyzer, which supports multiple microarchitectures up to Haswell. The latency predictions are to be taken with the grain of salt, since it much harder to estimate latency than throughput.

这篇关于估算每条指令的周期的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

估算每条指令的周期 [英] Estimating Cycles Per Instruction

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

估算每条指令的周期 [英] Estimating Cycles Per Instruction

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭