慢jmp-指令 [英] Slow jmp-instruction

查看:43
本文介绍了慢jmp-指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

随着我的问题在 x86-64 中使用 32 位寄存器/指令的优势,我开始衡量成本的指令.我知道这已经多次完成(例如 Agner Fog),但我正在做用于娱乐和自我教育.

As follow up to my question The advantages of using 32bit registers/instructions in x86-64, I started to measure the costs of instructions. I'm aware that this have been done multiple times (e.g. Agner Fog), but I'm doing it for fun and self education.

我的测试代码非常简单(为了简单起见,这里是伪代码,实际上是汇编程序):

My testing code is pretty simple (for simplicity here as pseudo code, in reality in assembler):

for(outer_loop=0; outer_loop<NO;outer_loop++){
    operation  #first
    operation  #second
    ...
    operation #NI-th
} 

但还是要考虑一些事情.

But yet some things should be considered.

  1. 如果循环的内部部分很大(大NI>10^7),则循环的整个内容无法放入指令缓存中,因此必须一遍又一遍地加载,使 RAM 的速度定义执行所需的时间.例如,对于较大的内部部件,xorl %eax, %eax(2 个字节)比 xorq %rax, %rax(3 个字节)快 33%.
  2. 如果 NI 很小并且整个循环很容易放入指令缓存,那么 xorl %eax, %eaxxorq %rax, %rax 同样快,每个时钟周期可以执行 4 次.
  1. If the inner part of the loop is large (large NI>10^7), the whole content of the loop does not fit into the instruction cache and thus must be loaded over and over again, making the speed of RAM define the time needed for execution. For example, for large inner parts, xorl %eax, %eax (2 bytes) is 33% faster than xorq %rax, %rax (3 bytes).
  2. If NI is small and the whole loop fits easily into the instruction cache, than xorl %eax, %eax and xorq %rax, %rax are equally fast and can be executed 4 times per clock cycle.

然而,这个简单的模型并不适用于 jmp 指令.对于 jmp 指令,我的测试代码如下所示:

However this simple model does not hold water for the jmp-instruction. For the jmp-instruction my test code looks as follows:

for(outer_loop=0; outer_loop<NO;outer_loop++){
    jmp .L0
    .L0: jmp .L1
    L1: jmp L2
    ....
}

结果是:

  1. 对于大"循环大小(已经用于 NI>10^4),我测量了 4.2 ns/jmp-指令(相当于从 RAM 加载的 42 个字节或我的机器上大约有 12 个时钟周期).
  2. 对于小循环大小 (NI<10^3),我测量了 1 ns/jmp- 指令(大约 3 个时钟周期,这听起来似乎合理 - Agner Fog 的表格显示了 2 个时钟周期的成本).
  1. For "large" loop sizes (already for NI>10^4) I measure 4.2 ns/jmp-instruction ( would equate to 42 bytes loaded from RAM or ca. 12 clock cycles on my machine).
  2. For small loop sizes (NI<10^3) I measure 1 ns/jmp-instruction (which is around 3 clock cycles, which sounds plausible - Agner Fog's tables shows costs of 2 clock cycles).

指令jmp LX使用2字节eb 00编码.

因此,我的问题是:对于大"循环中 jmp 指令的高成本的解释是什么?

Thus, my question: What could be the explanation for the high cost of jmp-instruction in the "large" loops?

PS:如果您想在自己的机器上试用,可以从 此处,只需在 src 文件夹中运行 sh jmp_test.sh.

PS: If you like to try it out on your machine, you can download the scripts from here, just run sh jmp_test.sh in src-folder.

实验结果证实了彼得的 BTB 大小理论.

Experimental results confirming Peter's BTB size theory.

下表显示了不同ǸI值(相对于NI=1000)的每条指令周期:

The following table shows cycles per instruction for different ǸI values (relative to NI=1000):

|oprations/ NI        | 1000 |  2000|  3000|  4000|  5000| 10000|
|---------------------|------|------|------|------|------|------|
|jmp                  |  1.0 |  1.0 |  1.0 |  1.2 |  1.9 |   3.8|
|jmp+xor              |  1.0 |  1.2 |  1.3 |  1.6 |  2.8 |   5.3|
|jmp+cmp+je (jump)    |  1.0 |  1.5 |  4.0 |  4.4 |  5.5 |   5.5|
|jmp+cmp+je (no jump) |  1.0 |  1.2 |  1.3 |  1.5 |  3.8 |   7.6|

可以看出:

  1. 对于 jmp 指令,(但未知)资源变得稀缺,这导致 ǸI 大于 4000 的性能下降.
  2. 此资源不与 xor 等指令共享 - 如果 jmpjmp 约 4000,NI 的性能仍会下降code>xor 依次执行.
  3. 但是这个资源是和 je 共享的,如果跳转了 - 对于 jmp+je 一个接一个,资源变得稀缺NI 大约 2000.
  4. 但是,如果 je 根本不跳转,那么 NI 资源将再次变得稀缺,大约 4000(第 4 行).
  1. For the jmp instruction, a (yet unknown) resource becomes scarce and this leads to a performance degradation for ǸI larger than 4000.
  2. This resource is not shared with such instructions as xor - the performance degradation kicks in still for NI about 4000, if jmp and xor are executed after each other.
  3. But this resource is shared with je if the jump is made - for jmp+je after each other, the resource becomes scarce for NI about 2000.
  4. However, if je does not jump at all, the resource is becoming scarce once again for NI being about 4000 (4th line).

Matt Godbolt 的分支预测逆向工程文章 确立了分支目标缓冲区容量为 4096 个条目.这是非常有力的证据,表明 BTB 未命中是观察到的小型和大型 jmp 循环之间的吞吐量差异的原因.

Matt Godbolt's branch-prediction reverse engineering articles establishes that the branch target buffer capacity is 4096 entries. That is very strong evidence that BTB misses are the reason for the observed throughput difference between small and large jmp loops.

推荐答案

TL:DR:我目前的猜测是 BTB(分支目标缓冲区)条目用完了.流水线代码获取需要在解码之前预测无条件分支的存在.见下文.

TL:DR: my current guess is running out of BTB (branch target buffer) entries. Pipelined code fetch needs to predict the existence of an unconditional branch before it's even decoded. See below.

2021 年更新:https://blog.cloudflare.com/branch-predictor/ 详细探讨了这一点,使用一个 jmp next_insn 块作为实验.例如,分支密度和混叠(相对于 64 字节行的相同偏移量)可能很重要.

2021 update: https://blog.cloudflare.com/branch-predictor/ explores this in detail, using a block of jmp next_insn as an experiment. Branch density, and aliasing (same offset relative to a 64-byte line) for example can matter.

即使您的 jmp 没有操作,CPU 也没有额外的晶体管来检测这种特殊情况.它们的处理方式与任何其他 jmp 一样,这意味着必须从新位置重新开始提取指令,从而在管道中产生气泡.

Even though your jmps are no-ops, the CPU doesn't have extra transistors to detect this special-case. They're handled just like any other jmp, which means having to re-start instruction fetch from a new location, creating a bubble in the pipeline.

要了解有关跳转及其对流水线 CPU 的影响的更多信息,请控制经典 RISC 流水线中的危害 应该是一个很好的介绍,为什么分支对于流水线 CPU 来说很困难.Agner Fog 的指南解释了实际含义,但我认为假设有一些此类背景知识.

To learn more about jumps and their effect on pipelined CPUs, Control Hazards in a classic RISC pipeline should be a good intro to why branches are difficult for pipelined CPUs. Agner Fog's guides explain the practical implications, but I think assume some of that kind of background knowledge.

您的 Intel Broadwell CPU 有一个 uop-cache,用于缓存解码指令(与 32kiB L1 I-cache 分开).

Your Intel Broadwell CPU has a uop-cache, which caches decoded instructions (separate from the 32kiB L1 I-cache).

uop缓存大小为32组8路,每行6个uop,总共1536个uop(如果每行都挤满6个uop,效率完美).1536 uop 介于 1000 和 10000 测试大小之间.在您编辑之前,我预测从慢到快的截止时间将在您的循环中总共 1536 条指令左右.直到远远超过 1536 条指令,它才不会变慢,所以我认为我们可以排除 uop-cache 影响.这不是我想的那么简单的问题.:)

The uop cache size is 32 sets of 8 ways, with 6 uops per line, for a total of 1536 uops (if every line is packed with 6 uops; perfect efficiency). 1536 uops is between your 1000 and 10000 test sizes. Before your edit, I predicted that the cutoff for slow to fast would be right around 1536 total instructions in your loop. It doesn't slow down at all until well beyond 1536 instructions, so I think we can rule out uop-cache effects. This isn't as simple a question as I thought. :)

从 uop 缓存(小代码大小)而不是 x86 指令解码器(大循环)运行意味着在识别 jmp 指令的阶段之前有更少的流水线阶段.因此,即使预测正确,我们也可能期望来自持续跳跃流的气泡更小.

Running from the uop-cache (small code size) instead of the x86 instruction decoders (large loops) means that there are fewer pipeline stages before the stage that recognizes jmp instructions. So we might expect the bubbles from a constant stream of jumps to be smaller, even though they're predicted correctly.

从解码器运行应该会产生更大的分支预测错误(比如可能是 20 个周期而不是 15 个),但这些不是错误预测的分支.

Running from the decoders is supposed to give a larger branch mispredict penalty (like maybe 20 cycles instead of 15), but these aren't mispredicted branches.

即使 CPU 不需要预测分支是否被采用,它仍然使用分支预测资源在解码之前预测代码块包含一个被采用的分支.

Even though the CPU doesn't need to predict whether the branch is taken or not, it still uses branch-prediction resources to predict that a block of code contains a taken branch before it's decoded.

缓存某个代码块中存在分支的事实及其目标地址,允许前端在 jmp rel32 编码实际解码之前开始从分支目标获取代码.请记住,解码可变长度 x86 指令是困难的:在解码前一条指令之前,您不知道一条指令从哪里开始.因此,您不能仅对指令流进行模式匹配,以在获取指令流后立即寻找无条件跳转/调用.

Caching the fact that there is a branch in a certain block of code, and its target address, allows the frontend to start fetching code from the branch target before the jmp rel32 encoding is actually decoded. Remember that decoding variable-length x86 instructions is hard: you don't know where one instruction starts until the previous one is decoded. So you can't just pattern-match the instruction stream looking for unconditional jumps / calls as soon as its fetched.

我目前的理论是,当分支目标缓冲区条目用完时,速度会变慢.

另见分支目标缓冲区检测到什么分支错误预测? 在这个 Realworldtech 线程中有一个很好的答案和讨论.

一个非常重要的点:BTB 预测下一个要获取哪个块,而不是一个获取块中特定分支的确切目的地.因此,不必预测获取块中所有分支的目标,CPU只需要预测下一次获取的地址.

One very important point: the BTB predicts in terms of which block to fetch next, rather than the exact destination of a specific branch within a fetch block. So instead of having to predict targets for all branches in a fetch block, the CPU just needs to predict the address of the next fetch.

是的,当运行诸如异或归零之类的吞吐量非常高的东西时,内存带宽可能是一个瓶颈,但是您在使用 jmp 时遇到了不同的瓶颈.CPU 将有时间从内存中获取 42B,但这不是它正在做的.预取很容易跟上每 3 个时钟 2 个字节的速度,因此 L1 I-cache 未命中率应该接近于零.

Yes, memory bandwidth can be a bottleneck when running very high throughput stuff like xor-zeroing, but you're hitting a different bottleneck with jmp. The CPU would have time to fetch 42B from memory, but that's not what it's doing. Prefetch can easily keep up with 2 bytes per 3 clocks, so there should be near-zero L1 I-cache misses.

在带有/不带有 REX 测试的 xor 中,如果您使用足够大的循环进行测试而不适合 L3 缓存,则主内存带宽实际上可能是那里的瓶颈.我在大约 3GHz 的 CPU 上每个周期消耗 4 * 2B,这几乎可以达到 25GB/s 的 DDR3-1600MHz.不过,即使是 L3 缓存也足够快,可以跟上每个周期 4 * 3B 的速度.

In your xor with/without REX test, main memory bandwidth might actually have been the bottleneck there if you tested with a large enough loop to not fit in L3 cache. I consumes 4 * 2B per cycle on a ~3GHz CPU, which does just about max out the 25GB/s of DDR3-1600MHz. Even the L3 cache would be fast enough to keep up with 4 * 3B per cycle, though.

有趣的是,主存 BW 是瓶颈;我最初猜测解码(以 16 字节为单位)将是 3 字节 XOR 的瓶颈,但我猜它们已经足够小了.

That's interesting that main memory BW is the bottleneck; I initially guessed that decode (in blocks of 16 bytes) would be the bottleneck for 3-byte XORs, but I guess they're small enough.

另请注意,在核心时钟周期中测量时间更为正常.但是,当您查看内存时,以 ns 为单位的测量值很有用,我猜,因为用于省电的低时钟速度会改变核心时钟速度与内存速度的比率.(即,在最低 CPU 时钟速度下,内存瓶颈不是什么问题.)

Also note that its a lot more normal to measure times in core clock cycles. However, your measurements in ns are useful when you're looking at memory, I guess, because low clock speeds for power-saving change the ratio of core clock speed to memory speed. (i.e. memory bottlenecks are less of a problem at minimum CPU clock speed.)

对于时钟周期的基准测试,请使用 perf stat ./a.out.还有其他有用的性能计数器,它们对于尝试了解性能特征必不可少.

请参阅 x86-64 相对 jmp 性能了解性能计数器Core2 的结果(每个 jmp 8 个周期),以及一些未知的微架构,每个 jmp 大约为 10c.

See x86-64 Relative jmp performance for perf-counter results from Core2 (8 cycles per jmp), and some unknown microarchitecture where it's ~10c per jmp.

现代 CPU 性能特征的细节即使在或多或少的白盒条件下也很难理解(阅读英特尔的优化手册,以及他们发布的有关 CPU 内部的内容).如果您坚持黑盒测试,而不阅读诸如有关新 CPU 设计的 arstechnica 文章之类的内容,或者一些更详细的内容,例如 David Kanter 的 Haswell 微架构概述,或我之前链接的类似 Sandybridge 文章.

The details of modern CPU performance characteristics are hard enough to understand even under more or less white-box conditions (reading Intel's optimization manual, and what they've published regarding CPU internals). You're going to get stuck early and often if you insist on black-box testing where you don't read stuff like arstechnica articles about the new CPU design, or maybe some more detailed stuff like David Kanter's Haswell microarch overview, or the similar Sandybridge writeup I linked earlier.

如果过早地卡住并经常被卡住并且您玩得很开心,那么一定要继续做您正在做的事情.但是,如果您不知道这些细节,就会让人们更难回答您的问题,例如在这种情况下.:/例如我这个答案的第一个版本假设您已经阅读了足够多的内容来了解​​ uop 缓存是什么.

If getting stuck early and often is ok and you're having fun, then by all means keep doing what you're doing. But it makes it harder for people to answer your questions if you don't know those details, like in this case. :/ e.g. my first version of this answer assumed you had read enough to know what the uop cache was.

这篇关于慢jmp-指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆