汇编-如何通过延迟和吞吐量对CPU指令进行评分 [英] Assembly - How to score a CPU instruction by latency and throughput

查看:129
本文介绍了汇编-如何通过延迟和吞吐量对CPU指令进行评分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种类型的公式/方式来衡量一条指令的执行速度,或更具体地讲,该指令可以给出分数".每条指令按CPU周期进行.

以以下汇编程序为例,

nop                     
mov         eax,dword ptr [rbp+34h] 
inc         eax     
mov         dword ptr [rbp+34h],eax  

以及以下英特尔Skylake信息:

mov r,m:吞吐量= 0.5延迟= 2

mov m,r :吞吐量= 1延迟= 2

nop:吞吐量= 0.25延迟=非

inc:吞吐量= 0.25延迟== 1

我知道程序中指令的顺序在这里很重要,但是 我正在寻找一种通用的东西,不需要精确到单个周期"

有人知道我该怎么做吗?

非常感谢

解决方案

没有任何公式可以应用;您必须进行测量.

在同一uarch系列的不同版本上的同一指令可能具有不同的性能.例如mulps:

  • Sandybridge 1c/5c吞吐量/延迟.
  • HSW 0.5/5.BDW0.5/3(FMA单位中更快的乘法路径?FMA仍为5c).
  • SKL 0.5/4(更低的延迟FMA). SKL也在FMA单元上运行addps,删除了专用FP乘法单元,因此添加延迟更高,但吞吐量更高.

如果不进行测量或不了解某些微体系结构细节,就无法预测其中的任何一个.我们预计FP数学运算将不会是单周期延迟,因为它们比整数运算要复杂得多. (因此,如果它们是单个周期,则对于整数运算而言,时钟速度设置得太低.)


您可以通过展开循环中的多次指令来进行测量.或者完全展开而没有循环,但随后您击败了uop缓存并可能遇到前端瓶颈. (例如,用于解码10字节的mov r64, imm64)

Agner Fog通过定时重复指令的大型非循环代码块来创建他的指令表(您似乎正在阅读). https://agner.org/optimize/.他的说明表的简介部分简要介绍了他的测量方法,而他的微体系结构指南则详细介绍了不同的x86微体系结构在内部如何工作.

http://instlatx64.atw.hu/也具有实验测量结果.我认为他们使用类似的技术,即重复执行同一条指令的大块,可能足够小以适合uop缓存.但是他们不使用性能计数器来衡量每条指令所需的执行端口,因此它们的吞吐量数字无法帮助您确定哪些指令与其他指令竞争.


要自己测量延迟,您可以将每条指令的输出作为下一条指令的输入.

 mov  ecx, 10000000
 inc_latency:
     inc eax
     inc eax
     inc eax
     inc eax
     inc eax
     inc eax

     sub ecx,1          ; avoid partial-flag false dep for P4
     jnz inc_latency    ; dec or sub/jnz macro-fuses into 1 uop on Intel SnB-family

此7条inc指令的依存关系链将在每个7 * inc_latency周期进行1次迭代时使循环成为瓶颈.通过将性能计数器用于核心时钟周期(而不是RDTSC周期),您可以轻松地将 all 迭代的时间测量为1k的1k,并且可能要更精确一些.重复计数10000000会隐藏您使用任何时间的启动/停止开销.

我通常在Linux静态可执行文件中放置一个这样的循环,该循环仅通过syscall指令直接进行sys_exit(0)系统调用,并使用perf stat ./testloop对整个可执行文件计时以获取时间和周期计数. (请参阅 x86的MOV是否可以是免费"的?为什么我根本不能复制它?).

另一个示例是 INC指令与ADD 1:有关系吗? )

在Sandybridge系列上测量shl r32, cl的吞吐量和等待时间也存在类似的问题:标记依赖关系链通常与计算无关,但是将shl背对背放置会通过FLAGS创建依赖关系,如下所示:以及通过注册. (或者对于吞吐量,甚至没有寄存器深度).

我在Agner Fog的博客上发布了此内容: https://www.agner.org/optimize/blog/read.php?i=415#860 .我将shl edx,cl与四个add edx,1指令混合在一起,以查看增加了另一条指令的增量减慢情况,其中FLAGS依赖关系不是问题.在SKL上,它平均只会平均减少额外的1.23个周期,因此shl的真正延迟成本仅为〜1.23个周期,而不是2个.(这不是整数,也不是1,因为运行资源冲突我猜想shl的标志合并uops.BMI2shlx edx, edx, ecx恰好是1c,因为它只是一个uop.


相关:有关整个代码块的静态性能分析(包含不同的指令),请参见


用于加载/存储的Latency=2数字似乎来自Agner Fog的指令表( https://agner.org/optimize/).不幸的是,对于mov rax, [rax]链,它们并不准确.您会发现这是4分 如果将其放在一个循环中进行测量,则会导致延迟.

Agner将加载/存储延迟划分为可以使总存储/重新加载延迟正确显示的内容,但是由于某种原因,当他来自高速缓存时,他的加载部分不等于L1d加载使用延迟.存储缓冲区的大小. (但也请注意,如果负载提供的是ALU指令而不是其他负载,则延迟为5c.因此,简单的寻址模式快速路径仅有助于纯指针跟踪.)

I'm looking for a type of a formula / way to measure how fast an instruction is, or more specific to give a "score" each of the instruction by CPU cycles.

Let's take the follow assembly program for an example,

nop                     
mov         eax,dword ptr [rbp+34h] 
inc         eax     
mov         dword ptr [rbp+34h],eax  

and the following Intel Skylake information:

mov r,m : Throughput=0.5 Latency=2

mov m,r : Throughput=1 Latency=2

nop : Throughput=0.25 Latency=non

inc : Throughput=0.25 Latency=1

I know that the order of the instructions in the program are matter in here but I'm looking to create something general that not need to be "accurate to the single cycle"

any one have any idea how can I do that?

Thank a lot

解决方案

There is no formula you can apply; you have to measure.

The same instruction on different versions of the same uarch family can have different performance. e.g. mulps:

  • Sandybridge 1c / 5c throughput/latency.
  • HSW 0.5 / 5. BDW 0.5 / 3 (faster multiply path in the FMA unit? FMA is still 5c).
  • SKL 0.5 / 4 (lower latency FMA, too). SKL runs addps on the FMA unit as well, dropping the dedicated FP multiply unit so add latency is higher, but throughput is higher.

There's no way you could predict any of this without measuring, or knowing some microarchitectural details. We expect FP math ops won't be single-cycle latency, because they're much more complicated than integer ops. (So if they were single cycle, the clock speed is set too low for integer ops.)


You measure by repeating the instruction many times in an unrolled loop. Or fully unrolled with no looping, but then you defeat the uop-cache and can get front-end bottlenecks. (e.g. for decoding 10-byte mov r64, imm64)

Agner Fog creates his instruction tables (which you appear to be reading) by timing large non-looping blocks of code that repeat an instruction. https://agner.org/optimize/. The intro section of his instruction-tables explains briefly how he measures, and his microarch guide explains more details of how different x86 microarchitectures work internally.

http://instlatx64.atw.hu/ also has results of experimental measurements. I think they use a similar technique of a large block of the same instruction repeated, maybe small enough to fit in the uop cache. But they don't use perf counters to measure what execution port each instruction needs, so their throughput numbers don't help you figure out which instructions compete with which other instructions.


To measure latency yourself, you make the output of each instruction an input for the next.

 mov  ecx, 10000000
 inc_latency:
     inc eax
     inc eax
     inc eax
     inc eax
     inc eax
     inc eax

     sub ecx,1          ; avoid partial-flag false dep for P4
     jnz inc_latency    ; dec or sub/jnz macro-fuses into 1 uop on Intel SnB-family

This dependency chain of 7 inc instructions will bottleneck the loop at 1 iteration per 7 * inc_latency cycles. Using perf counters for core clock cycles (not RDTSC cycles), you can easily measure the time for all the iterations to 1 part in 10k, and with more care probably even more precisely than that. The repeat count of 10000000 hides start/stop overhead of whatever timing you use.

I normally put a loop like this in a Linux static executable that just makes a sys_exit(0) system call directly (with a syscall) instruction, and time the whole executable with perf stat ./testloop to get time and a cycle count. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example).

Another example is Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths, with the added complication of using lfence to drain the out-of-order execution window for two dep chains.


To measure throughput, you use separate registers, and/or include an xor-zeroing occasionally to break dep chains and let out-of-order exec overlap things. Don't forget to also use perf counters to see which ports it can run on, so you can tell which other instructions it will compete with. (e.g. FMA (p01) and shuffles (p5) don't compete at all for back-end resources on Haswell/Skylake, only for front-end throughput.) Don't forget to measure front-end uop counts, too: some instructions decode to multiply uops.

How many different dependency chains do we need to avoid a bottleneck? Well we know the latency (measure it first), and we know the max possible throughput (number of execution ports, or front-end throughput.)

For example, if FP multiply had 0.25c throughput (4 per clock), we could keep 20 in flight at once on Haswell (5c latency). That's more than we have registers, so we could just use all 16 and discover that in fact the throughput is only 0.5c. But if it had turned out that 16 registers was a bottleneck, we could add xorps xmm0,xmm0 occasionally and let out-of-order execution overlap some blocks.

More is normally better; having just barely enough to hide latency can slow down with imperfect scheduling. If we wanted to go nuts measuring inc, we'd do this:

 mov  ecx, 10000000
 inc_latency:
   %rep 10          ;; source-level repeat of a block, no runtime branching
     inc eax
     inc ebx
     ; not ecx, we're using it as a loop counter
     inc edx
     inc esi
     inc edi
     inc ebp
     inc r8d
     inc r9d
     inc r10d
     inc r11d
     inc r12d
     inc r13d
     inc r14d
     inc r15d
   %endrep

     sub ecx,1          ; break partial-flag false dep for P4
     jnz inc_latency    ; dec/jnz macro-fuses into 1 uop on Intel SnB-family


If we were worried about partial-flag false dependencies or flag-merging effects, we might experiment with mixing in an xor eax,eax somewhere to let OoO exec overlap more than just when sub wrote all flags. (See INC instruction vs ADD 1: Does it matter?)

There's a similar problem for measuring throughput and latency of shl r32, cl on Sandybridge-family: the flag dependency chain isn't normally relevant for a computation, but putting shl back-to-back creates a dependency through FLAGS as well as through the register. (Or for throughput, there isn't even a register dep).

I posted about this on Agner Fog's blog: https://www.agner.org/optimize/blog/read.php?i=415#860. I mixed shl edx,cl in with four add edx,1 instructions, to see what incremental slowdown adding one more instruction had, where the FLAGS dependency was a non-issue. On SKL, it only slows down by an extra 1.23 cycles on average, so the true latency cost of that shl was only ~1.23 cycles, not 2. (It's not a whole number or just 1 because of resource conflicts to run the flag-merging uops of the shl, I guess. BMI2 shlx edx, edx, ecx would be exactly 1c because it's only a single uop.)


Related: for static performance analysis of whole blocks of code (containing different instructions), see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. (It's using the word "latency" for the end-to-end latency of a whole computation, but actually asking about things small enough for OoO exec to overlap different parts, so instruction latency and throughput both matter.)


The Latency=2 numbers for load/store appear to be from Agner Fog's instruction tables (https://agner.org/optimize/). They unfortunately aren't accurate for a chain of mov rax, [rax]. You'll find that's 4c latency if you measure it by putting that in a loop.

Agner splits up load/store latency into something that makes the total store/reload latency come out correct, but for some reason he doesn't make the load part equal to the L1d load-use latency when it comes from cache instead of the store buffer. (But also note that if the load feeds an ALU instruction instead of another load, the latency is 5c. So the simple addressing-mode fast-path only helps for pure pointer-chasing.)

这篇关于汇编-如何通过延迟和吞吐量对CPU指令进行评分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆