估计每个指令的周期 [英] Estimating Cycles Per Instruction

查看：21 发布时间：2021/11/17 2:53:31 performance assembly architecture x86 sse

本文介绍了估计每个指令的周期的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经反汇编了一个用 MSVC v140 编译的小型 C++ 程序，并试图估计每条指令的周期，以便更好地了解代码设计如何影响性能.我一直在关注 Mike Acton 在 CppCon 2014 上关于面向数据的设计和 C++" 的演讲，特别是我链接到的部分.

I have disassembled a small C++ program compiled with MSVC v140 and am trying to estimate the cycles per instruction in order to better understand how code design impacts performance. I've been following Mike Acton's CppCon 2014 talk on "Data-Oriented Design and C++", specifically the portion I've linked to.

在其中，他指出了以下几行:

In it, he points out these lines:

movss   8(%rbx), %xmm1
movss   12(%rbx), %xmm0

然后他声称这些 2 x 32 位读取可能在同一个缓存行上，因此花费大约 200 个周期.

英特尔 64 位和 IA-32 架构优化参考手册是一个很好的资源，特别是附录 C - 指令延迟和吞吐量".但是，在表 C-16.流式 SIMD 扩展单精度浮点指令" 的第 C-15 页上，它声明 movss 仅为 1 个周期(除非我'我理解延迟的含义是错误的...如果是这样，我该如何阅读?)

The Intel 64 and IA-32 Architectures Optimization Reference Manual has been a great resource, specifically "Appendix C - Instruction Latency and Throughput". However on page C-15 in "Table C-16. Streaming SIMD Extension Single-precision Floating-point Instructions" it states that movss is only 1 cycle (unless I'm understanding what latency means here wrong... if so, how do I read this thing?)

我知道执行时间的理论预测永远不会是正确的，但这仍然很重要.这两个命令的 200 个周期如何，我如何学会推理超出此代码段的执行时间?

I know that a theoretical prediction of execution time will never be correct, but nevertheless this is important to learn. How are these two commands 200 cycles, and how can I learn to reason about execution time beyond this snippet?

我已经开始阅读有关 CPU 流水线的一些内容……也许大部分周期都在那里被拾取?

I've started to read some things on CPU pipelining... maybe the majority of the cycles are being picked up there?

PS:我对在这里实际测量硬件性能计数器不感兴趣.我只是想学习如何合理地阅读 ASM 和周期.

PS: I'm not interested in actually measuring hardware performance counters here. I'm just looking to learn how to reasonable sight read ASM and cycles.

推荐答案

正如您已经指出的，MOVSS 指令的理论吞吐量和延迟为 1 个周期.您正在查看正确的文档(英特尔优化手册).Agner Fog(在评论中提到)在他的 Instruction Tables 中针对 Intel CPU 测量了相同的数字(AMD 有更高的延迟).

As you already pointed out, the theoretical throughput and latency of a MOVSS instruction is at 1 cycle. You were looking at the right document (Intel Optimization Manual). Agner Fog (mentioned in the comments) measured the same numbers in his Intruction Tables for Intel CPUs (AMD is has a higher latency).

这将我们引向了第一个问题:您正在研究什么特定的微架构?即使对于同一供应商，这也会产生很大的不同.Agner Fog 报告说 MOVSS 在 AMD Bulldozer 上有 2-6cy 延迟，具体取决于源和目标(寄存器与内存).在研究计算机架构的性能时要牢记这一点.

This leads us to the first problem: What specific microarchitecture are you investigating? This can make a big difference, even for the same vendor. Agner Fog reports that MOVSS has a 2-6cy latency on AMD Bulldozer depending on the source and destination (register vs memory). This is important to keep in mind when looking into performance of computer architectures.

200cy 很可能是缓存未命中，正如评论中已经指出的那样.您从优化手册中获得的任何内存访问指令的数字都是基于数据驻留在一级缓存 (L1) 中的假设.现在，如果您之前的指令从未接触过数据，则需要将缓存行(Intel 和 AMD x86 为 64 字节)从内存加载到最后一级缓存中，在那里形成二级缓存，然后进入 L1，最后进入 XMM 寄存器(在 1 个周期内).在当前英特尔微架构上，L3-L2 和 L2-L1 之间的传输具有每个缓存线两个周期的吞吐量(不是延迟！).并且内存带宽可用于估计 L3 和内存之间的吞吐量(例如，可实现内存带宽为 40 GB/s 的 2 GHz CPU 将具有每个缓存线 3.2 个周期的吞吐量).缓存线或内存块通常是缓存和内存可以操作的最小单位，它们在微架构之间有所不同，甚至在架构内也可能有所不同，具体取决于缓存级别(L1、L2 等).

The 200cy are most likely cache misses, as already pointed out be dwelch in the comments. The numbers you get from the Optimization Manual for any memory accessing instructions are all under the assumption that the data resides in the first level cache (L1). Now, if you have never touched the data by previous instructions the cache line (64 bytes with Intel and AMD x86) will need to be loaded from memory into the last level cache, form there into the second level cache, then into L1 and finally into the XMM register (within 1 cycle). Transfers between L3-L2 and L2-L1 have a throughput (not latency!) of two cycles per cache line on current Intel microarchitectures. And the memory bandwidth can be used to estimate the throughput between L3 and memory (e.g, a 2 GHz CPU with an achievable memory bandwidth of 40 GB/s will have a throughput of 3.2 cycles per cache line). Cache lines or memory blocks are typically the smallest unit caches and memory can operate on, they differ between microarchitectures and may even be different within the architecture, depending on the cache level (L1, L2 and so on).

现在这是所有吞吐量而不是延迟，这将无助于您估计上面描述的内容.要验证这一点，您需要一遍又一遍地执行指令(至少 1/10 秒)以获得周期准确的测量值.通过更改指令，您可以决定是要测量延迟(通过包括指令之间的依赖关系)还是吞吐量(通过使指令输入独立于先前指令的结果).要测量缓存和内存访问，您需要预测访问是否会进入缓存，这可以使用层条件.

Now this is all throughput and not latency, which will not help you estimate what you have described above. To verify this, you would need to execute the instructions over and over (for at least 1/10s) to get cycle accurate measurements. By changing the instructions you can decide if you want to measure for latency (by including dependencies between instructions) or throughput (by having the instructions input independent of the result of previous instructions). To measure for caches and memory accesses you would need to predict if an access is going to a cache or not, this can be done using layer conditions.

估算英特尔 CPU 指令执行(包括延迟和吞吐量)的工具是英特尔架构代码分析器，支持多种微架构，最高可达 Haswell.延迟预测必须谨慎，因为估计延迟比估计吞吐量要困难得多.

A tool to estimate instruction execution (both latency and throughput) for Intel CPUs is the Intel Architecture Code Analyzer, which supports multiple microarchitectures up to Haswell. The latency predictions are to be taken with the grain of salt, since it much harder to estimate latency than throughput.

这篇关于估计每个指令的周期的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

估计每个指令的周期 [英] Estimating Cycles Per Instruction

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

估计每个指令的周期 [英] Estimating Cycles Per Instruction

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭