汇编 - 如何通过延迟和吞吐量对 CPU 指令进行评分 [英] Assembly - How to score a CPU instruction by latency and throughput

查看:42
本文介绍了汇编 - 如何通过延迟和吞吐量对 CPU 指令进行评分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种公式/方法来衡量指令的速度,或者更具体地给出分数";每条指令按 CPU 周期计算.

I'm looking for a type of a formula / way to measure how fast an instruction is, or more specific to give a "score" each of the instruction by CPU cycles.

以下面的汇编程序为例,

Let's take the follow assembly program for an example,

nop                     
mov         eax,dword ptr [rbp+34h] 
inc         eax     
mov         dword ptr [rbp+34h],eax  

以及以下英特尔 Skylake 信息:

and the following Intel Skylake information:

mov r,m : 吞吐量=0.5 延迟=2

mov r,m : Throughput=0.5 Latency=2

mov m,r:吞吐量=1 延迟=2

mov m,r : Throughput=1 Latency=2

nop:吞吐量=0.25 延迟=非

nop : Throughput=0.25 Latency=non

公司:吞吐量=0.25 延迟=1

inc : Throughput=0.25 Latency=1

我知道程序中指令的顺序在这里很重要,但是我希望创建一些不需要精确到单个周期"的通用内容

I know that the order of the instructions in the program are matter in here but I'm looking to create something general that not need to be "accurate to the single cycle"

有人知道我该怎么做吗?

any one have any idea how can I do that?

推荐答案

没有公式可以应用;你必须衡量.

相同的指令在同一个uarch系列的不同版本上可以有不同的性能.例如mulps:

The same instruction on different versions of the same uarch family can have different performance. e.g. mulps:

  • Sandybridge 1c/5c 吞吐量/延迟.
  • HSW 0.5/5. BDW 0.5/3(FMA 单元中更快的乘法路径?FMA 仍然是 5c).
  • SKL 0.5/4(也有更低的延迟 FMA).SKL 也在 FMA 单元上运行 addps,丢弃专用的 FP 乘法单元,因此添加延迟更高,但吞吐量更高.
  • Sandybridge 1c / 5c throughput/latency.
  • HSW 0.5 / 5. BDW 0.5 / 3 (faster multiply path in the FMA unit? FMA is still 5c).
  • SKL 0.5 / 4 (lower latency FMA, too). SKL runs addps on the FMA unit as well, dropping the dedicated FP multiply unit so add latency is higher, but throughput is higher.

如果不进行测量或不了解一些微架构细节,您就无法预测任何这些.我们预计 FP 数学运算不会是单周期延迟,因为它们比整数运算复杂得多.(因此,如果它们是单周期,则时钟速度对于整数操作设置得太低.)

There's no way you could predict any of this without measuring, or knowing some microarchitectural details. We expect FP math ops won't be single-cycle latency, because they're much more complicated than integer ops. (So if they were single cycle, the clock speed is set too low for integer ops.)

您可以通过在展开的循环中多次重复指令来进行测量.或者 完全 展开而没有循环,但是随后您会打败 uop-cache 并且可能会遇到前端瓶颈.(例如用于解码 10 字节 mov r64, imm64)

You measure by repeating the instruction many times in an unrolled loop. Or fully unrolled with no looping, but then you defeat the uop-cache and can get front-end bottlenecks. (e.g. for decoding 10-byte mov r64, imm64)

https://uops.info/ 已经为每个每个(非特权)指令的形式,您甚至可以单击任何表条目以查看它们使用的测试循环.例如Skylake xchg r32, eax 延迟测试(https://uops.info/html-lat/SKL/XCHG_R32_EAX-Measurements.html) 从每个输入操作数到每个输出.(来自 EAX 的 2 个周期延迟 -> R8D,但来自 R8D 的 1 个周期延迟 -> EAX.)所以我们可以猜测 3 个 uops 包括将 EAX 复制到内部临时,但直接从另一个操作数移动到 EAX.

https://uops.info/ has already automated this testing for every form of every (unprivileged) instruction, and you can even click on any table entry to see what test loops they used. e.g. Skylake xchg r32, eax latency testing (https://uops.info/html-lat/SKL/XCHG_R32_EAX-Measurements.html) from each input operand to each output. (2 cycle latency from EAX -> R8D, but 1 cycle latency from R8D -> EAX.) So we can guess that the 3 uops include copying EAX to an internal temporary, but moving directly from the other operand to EAX.

https://uops.info/ 是目前最好的测试数据来源;当它和 Agner 的表格不一致时,我自己的测量和/或其他来源总是证实 uops.info 的测试是准确的.而且他们不会像 movd xmm0,eax 和 back 那样尝试为往返的两半计算延迟数,它们会向您展示可能的延迟范围,假设链的其余部分是最合理的.

https://uops.info/ is the current best source of test data; when it and Agner's tables disagree, my own measurements and/or other sources have always confirmed uops.info's testing was accurate. And they don't try to make up a latency number for 2 halves of a round-trip like movd xmm0,eax and back, they show you the range of possible latencies assuming the rest of the chain was the minimum plausible.

Agner Fog 通过对重复指令的大型非循环代码块进行计时来创建他的指令表(您似乎正在阅读).https://agner.org/optimize/.他的指令表的介绍部分简要解释了他是如何测量的,他的微架构指南解释了不同 x86 微架构如何在内部工作的更多细节.不幸的是,他手工编辑的表格中偶尔会出现拼写错误或复制/粘贴错误.

Agner Fog creates his instruction tables (which you appear to be reading) by timing large non-looping blocks of code that repeat an instruction. https://agner.org/optimize/. The intro section of his instruction-tables explains briefly how he measures, and his microarch guide explains more details of how different x86 microarchitectures work internally. Unfortunately there are occasional typos or copy/paste errors in his hand-edited tables.

http://instlatx64.atw.hu/ 也有实验测量的结果.我认为他们使用了类似的技术,即重复大块相同的指令,可能小到足以放入 uop 缓存中.但是他们不使用性能计数器来衡量每条指令需要的执行端口,因此它们的吞吐量数字无法帮助您确定哪些指令与哪些其他指令竞争.

http://instlatx64.atw.hu/ also has results of experimental measurements. I think they use a similar technique of a large block of the same instruction repeated, maybe small enough to fit in the uop cache. But they don't use perf counters to measure what execution port each instruction needs, so their throughput numbers don't help you figure out which instructions compete with which other instructions.

后两个来源比 uops.info 存在的时间更长,并且涵盖了一些较旧的 CPU,尤其是较旧的 AMD.

These latter two sources have been around for longer than uops.info, and cover some older CPUs, especially older AMD.

要自己测量延迟,您可以将每条指令的输出作为下一条指令的输入.

 mov  ecx, 10000000
 inc_latency:
     inc eax
     inc eax
     inc eax
     inc eax
     inc eax
     inc eax

     sub ecx,1          ; avoid partial-flag false dep for P4
     jnz inc_latency    ; dec or sub/jnz macro-fuses into 1 uop on Intel SnB-family

这个由 7 个 inc 指令组成的依赖链将使循环瓶颈为每个 7 * inc_latency 循环 1 次迭代.将 perf 计数器用于核心时钟周期(而不是 RDTSC 周期),您可以轻松地将所有迭代的时间测量为 10k 的 1 部分,并且可能比这更精确.10000000 的重复计数隐藏了您使用的任何时间的启动/停止开销.

This dependency chain of 7 inc instructions will bottleneck the loop at 1 iteration per 7 * inc_latency cycles. Using perf counters for core clock cycles (not RDTSC cycles), you can easily measure the time for all the iterations to 1 part in 10k, and with more care probably even more precisely than that. The repeat count of 10000000 hides start/stop overhead of whatever timing you use.

我通常在 Linux 静态可执行文件中放置一个这样的循环,它只是直接(使用 syscall)指令进行 sys_exit(0) 系统调用,并计时带有 perf stat ./testloop 的整个可执行文件以获取时间和循环计数.(参见 x86 的 MOV 真的可以吗?是免费的"?为什么我根本不能重现这个? 举个例子).

I normally put a loop like this in a Linux static executable that just makes a sys_exit(0) system call directly (with a syscall) instruction, and time the whole executable with perf stat ./testloop to get time and a cycle count. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example).

另一个例子是 了解 lfence 对具有两个长依赖链的循环的影响,随着长度的增加,使用 lfence 排出两个无序执行窗口的额外复杂性深度链.

Another example is Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths, with the added complication of using lfence to drain the out-of-order execution window for two dep chains.

要测量吞吐量,您可以使用单独的寄存器,和/或偶尔包含一个异或归零以打破 dep 链并让乱序 exec 重叠事物. 不要忘记也使用perf 计数器以查看它可以在哪些端口上运行,因此您可以判断它将与哪些其他指令竞争.(例如,FMA (p01) 和 shuffle (p5) 根本不竞争 Haswell/Skylake 上的后端资源,仅针对前端吞吐量.)不要忘记测量前端 uop 计数:一些指令解码以乘以 uops.

To measure throughput, you use separate registers, and/or include an xor-zeroing occasionally to break dep chains and let out-of-order exec overlap things. Don't forget to also use perf counters to see which ports it can run on, so you can tell which other instructions it will compete with. (e.g. FMA (p01) and shuffles (p5) don't compete at all for back-end resources on Haswell/Skylake, only for front-end throughput.) Don't forget to measure front-end uop counts, too: some instructions decode to multiply uops.

我们需要多少个不同的依赖链才能避免瓶颈?好吧,我们知道延迟(首先测量它),我们知道最大可能的吞吐量(执行端口的数量,或前端吞吐量.)

How many different dependency chains do we need to avoid a bottleneck? Well we know the latency (measure it first), and we know the max possible throughput (number of execution ports, or front-end throughput.)

例如,如果 FP 乘法的吞吐量为 0.25c(每个时钟 4 个),我们可以在 Haswell 上同时保持 20 个运行(5c 延迟).这比我们拥有的寄存器还多,所以我们可以使用所有 16 个并发现实际上吞吐量只有 0.5c.但是如果结果证明 16 个寄存器是一个瓶颈,我们可以偶尔添加 xorps xmm0,xm​​m0 并让乱序执行重叠一些块.

For example, if FP multiply had 0.25c throughput (4 per clock), we could keep 20 in flight at once on Haswell (5c latency). That's more than we have registers, so we could just use all 16 and discover that in fact the throughput is only 0.5c. But if it had turned out that 16 registers was a bottleneck, we could add xorps xmm0,xmm0 occasionally and let out-of-order execution overlap some blocks.

通常越多越好;几乎没有足够的时间来隐藏延迟可能会因不完美的调度而变慢.如果我们想疯狂地测量 inc,我们会这样做:

More is normally better; having just barely enough to hide latency can slow down with imperfect scheduling. If we wanted to go nuts measuring inc, we'd do this:

 mov  ecx, 10000000
 inc_latency:
   %rep 10          ;; source-level repeat of a block, no runtime branching
     inc eax
     inc ebx
     ; not ecx, we're using it as a loop counter
     inc edx
     inc esi
     inc edi
     inc ebp
     inc r8d
     inc r9d
     inc r10d
     inc r11d
     inc r12d
     inc r13d
     inc r14d
     inc r15d
   %endrep

     sub ecx,1          ; break partial-flag false dep for P4
     jnz inc_latency    ; dec/jnz macro-fuses into 1 uop on Intel SnB-family


如果我们担心部分标志错误依赖或标志合并效果,我们可能会尝试在 xor eax,eax 某处混合,让 OoO exec 重叠更多,而不仅仅是 sub 写了所有的标志.(请参阅 INC 指令与 ADD 1:重要吗?)


If we were worried about partial-flag false dependencies or flag-merging effects, we might experiment with mixing in an xor eax,eax somewhere to let OoO exec overlap more than just when sub wrote all flags. (See INC instruction vs ADD 1: Does it matter?)

在 Sandybridge 系列上测量 shl r32, cl 的吞吐量和延迟存在类似的问题:标志依赖链通常与计算无关,但将 shl back-to-back 通过 FLAGS 和寄存器创建依赖关系.(或者对于吞吐量,甚至没有寄存器dep).

There's a similar problem for measuring throughput and latency of shl r32, cl on Sandybridge-family: the flag dependency chain isn't normally relevant for a computation, but putting shl back-to-back creates a dependency through FLAGS as well as through the register. (Or for throughput, there isn't even a register dep).

我在 Agner Fog 的博客上发布了相关内容:https://www.agner.org/optimize/blog/read.php?i=415#860.我将 shl edx,cl 与四个 add edx,1 指令混合在一起,看看增加一条指令有什么增量减速,其中 FLAGS 依赖性不是问题.在 SKL 上,它平均只慢了 1.23 个周期,所以 shl 的真正延迟成本只有 ~1.23 个周期,而不是 2.(它不是整数或只是 1,因为资源冲突来运行 shl 的标志合并 uops,我猜.BMI2 shlx edx, edx, ecx 正好是 1c,因为它只是一个 uop.)

I posted about this on Agner Fog's blog: https://www.agner.org/optimize/blog/read.php?i=415#860. I mixed shl edx,cl in with four add edx,1 instructions, to see what incremental slowdown adding one more instruction had, where the FLAGS dependency was a non-issue. On SKL, it only slows down by an extra 1.23 cycles on average, so the true latency cost of that shl was only ~1.23 cycles, not 2. (It's not a whole number or just 1 because of resource conflicts to run the flag-merging uops of the shl, I guess. BMI2 shlx edx, edx, ecx would be exactly 1c because it's only a single uop.)

相关:整个代码块(包含不同指令)的静态性能分析,参见预测现代超标量处理器上的操作的延迟需要考虑哪些因素,我如何手动计算它们?.(它使用延迟"这个词来表示整个计算的端到端延迟,但实际上询问的事情足够小,以便 OoO exec 可以重叠不同的部分,因此指令延迟和吞吐量都很重要.)

Related: for static performance analysis of whole blocks of code (containing different instructions), see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. (It's using the word "latency" for the end-to-end latency of a whole computation, but actually asking about things small enough for OoO exec to overlap different parts, so instruction latency and throughput both matter.)

加载/存储的 Latency=2 数字似乎来自 Agner Fog 的指令表 (https://agner.org/optimize/).不幸的是,它们对于 mov rax, [rax] 链来说并不准确.你会发现那是4c如果您通过将其置于循环中来衡量延迟.

The Latency=2 numbers for load/store appear to be from Agner Fog's instruction tables (https://agner.org/optimize/). They unfortunately aren't accurate for a chain of mov rax, [rax]. You'll find that's 4c latency if you measure it by putting that in a loop.

Agner 将加载/存储延迟拆分为使总存储/重新加载延迟正确的部分,但由于某种原因,当它来自缓存时,他没有使加载部分等于 L1d 加载使用延迟的存储缓冲区.(但也请注意,如果负载提供 ALU 指令而不是另一个负载,则延迟为 5c.因此,简单的寻址模式快速路径仅有助于纯指针追逐.)

Agner splits up load/store latency into something that makes the total store/reload latency come out correct, but for some reason he doesn't make the load part equal to the L1d load-use latency when it comes from cache instead of the store buffer. (But also note that if the load feeds an ALU instruction instead of another load, the latency is 5c. So the simple addressing-mode fast-path only helps for pure pointer-chasing.)

这篇关于汇编 - 如何通过延迟和吞吐量对 CPU 指令进行评分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆