慢速jmp指令 [英] Slow jmp-instruction

查看:123
本文介绍了慢速jmp指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在回答我的问题在x86-64中使用32位寄存器/指令的优点后,我开始衡量成本指令.我知道此操作已完成多次(例如 Agner Fog ),但我正在这样做它是为了娱乐和自我教育.

As follow up to my question The advantages of using 32bit registers/instructions in x86-64, I started to measure the costs of instructions. I'm aware that this have been done multiple times (e.g. Agner Fog), but I'm doing it for fun and self education.

我的测试代码非常简单(为简化起见,此处为伪代码,实际上是在汇编器中):

My testing code is pretty simple (for simplicity here as pseudo code, in reality in assembler):

for(outer_loop=0; outer_loop<NO;outer_loop++){
    operation  #first
    operation  #second
    ...
    operation #NI-th
} 

但是应该考虑一些事情.

But yet some things should be considered.

  1. 如果循环的内部较大(较大的NI>10^7),则循环的整个内容无法放入指令高速缓存中,因此必须一次又一次地加载,从而使RAM的速度决定了时间执行所需.例如,对于较大的内部零件,xorl %eax, %eax(2个字节)比xorq %rax, %rax(3个字节)快33%.
  2. 如果NI很小并且整个循环很容易放入指令高速缓存中,则xorl %eax, %eaxxorq %rax, %rax的速度同样快,并且每个时钟周期可以执行4次.
  1. If the inner part of the loop is large (large NI>10^7), the whole content of the loop does not fit into the instruction cache and thus must be loaded over and over again, making the speed of RAM define the time needed for execution. For example, for large inner parts, xorl %eax, %eax (2 bytes) is 33% faster than xorq %rax, %rax (3 bytes).
  2. If NI is small and the whole loop fits easily into the instruction cache, than xorl %eax, %eax and xorq %rax, %rax are equally fast and can be executed 4 times per clock cycle.

但是,这个简单的模型不能容纳jmp指令.对于jmp指令,我的测试代码如下:

However this simple model does not hold water for the jmp-instruction. For the jmp-instruction my test code looks as follows:

for(outer_loop=0; outer_loop<NO;outer_loop++){
    jmp .L0
    .L0: jmp .L1
    L1: jmp L2
    ....
}

结果是:

  1. 对于大"循环大小(已用于NI>10^4),我测量的值为4.2 ns/jmp指令(相当于从RAM加载的42个字节或计算机上的大约12个时钟周期).
  2. 对于较小的循环大小(NI<10^3),我测量的是1 ns/jmp-指令(大约3个时钟周期,这听起来很合理-Agner Fog的表显示了2个时钟周期的成本).
  1. For "large" loop sizes (already for NI>10^4) I measure 4.2 ns/jmp-instruction ( would equate to 42 bytes loaded from RAM or ca. 12 clock cycles on my machine).
  2. For small loop sizes (NI<10^3) I measure 1 ns/jmp-instruction (which is around 3 clock cycles, which sounds plausible - Agner Fog's tables shows costs of 2 clock cycles).

指令jmp LX使用2个字节的eb 00编码.

The instruction jmp LX uses the 2 byte eb 00 encoding.

因此,我的问题:在大"循环中jmp指令成本高的解释可能是什么?

Thus, my question: What could be the explanation for the high cost of jmp-instruction in the "large" loops?

PS:如果您想在计算机上试用,可以从

PS: If you like to try it out on your machine, you can download the scripts from here, just run sh jmp_test.sh in src-folder.

实验结果证实了Peter的BTB大小理论.

Experimental results confirming Peter's BTB size theory.

下表显示了不同ǸI值(相对于NI = 1000)的每条指令的周期:

The following table shows cycles per instruction for different ǸI values (relative to NI=1000):

|oprations/ NI        | 1000 |  2000|  3000|  4000|  5000| 10000|
|---------------------|------|------|------|------|------|------|
|jmp                  |  1.0 |  1.0 |  1.0 |  1.2 |  1.9 |   3.8|
|jmp+xor              |  1.0 |  1.2 |  1.3 |  1.6 |  2.8 |   5.3|
|jmp+cmp+je (jump)    |  1.0 |  1.5 |  4.0 |  4.4 |  5.5 |   5.5|
|jmp+cmp+je (no jump) |  1.0 |  1.2 |  1.3 |  1.5 |  3.8 |   7.6|

可以看到:

  1. 对于jmp指令,(迄今未知)资源变得稀缺,这导致ǸI大于4000的性能下降.
  2. 此资源未与xor之类的指令共享-如果jmpxor相互执行,则NI的性能下降仍会持续约4000.
  3. 但是,如果进行了跳转,则此资源与je共享-对于彼此后的jmp + je而言,NI的资源变得稀缺.
  4. 但是,如果je根本不跳,则资源再次变得稀缺,因为NI约为4000(第4行).
  1. For the jmp instruction, a (yet unknown) resource becomes scarce and this leads to a performance degradation for ǸI larger than 4000.
  2. This resource is not shared with such instructions as xor - the performance degradation kicks in still for NI about 4000, if jmp and xor are executed after each other.
  3. But this resource is shared with je if the jump is made - for jmp+je after each other, the resource becomes scarce for NI about 2000.
  4. However, if je does not jump at all, the resource is becoming scarce once again for NI being about 4000 (4th line).

Matt Godbolt的分支预测逆向工程文章确定了分支目标缓冲区容量为4096个条目.这是非常有力的证据,表明BTB遗漏是观察到的jmp循环在大和小循环之间吞吐量差异的原因.

Matt Godbolt's branch-prediction reverse engineering articles establishes that the branch target buffer capacity is 4096 entries. That is very strong evidence that BTB misses are the reason for the observed throughput difference between small and large jmp loops.

推荐答案

TL:DR:我当前的猜测用完了BTB(分支目标缓冲区)条目.见下文.

TL:DR: my current guess is running out of BTB (branch target buffer) entries. See below.

即使您的jmp没有操作,CPU也没有额外的晶体管来检测这种特殊情况.它们的处理方式与其他任何jmp一样,这意味着必须重新启动从新位置提取的指令,从而在管道中产生气泡.

Even though your jmps are no-ops, the CPU doesn't have extra transistors to detect this special-case. They're handled just like any other jmp, which means having to re-start instruction fetch from a new location, creating a bubble in the pipeline.

要了解有关跳跃及其对流水线CPU的影响的更多信息,请经典RISC管道中的控制危害应该很好地说明了为什么流水线CPU很难分支.阿格纳·福格(Agner Fog)的指南解释了实际的含义,但我认为应该假定其中的一些背景知识.

To learn more about jumps and their effect on pipelined CPUs, Control Hazards in a classic RISC pipeline should be a good intro to why branches are difficult for pipelined CPUs. Agner Fog's guides explain the practical implications, but I think assume some of that kind of background knowledge.

您的Intel Broadwell CPU 具有uop缓存,该缓存可缓存已解码的指令(与32kiB L1 I缓存分开).

Your Intel Broadwell CPU has a uop-cache, which caches decoded instructions (separate from the 32kiB L1 I-cache).

uop缓存大小为8组32组,每行6 oups,总计1536 uops(如果每行装满6 uops,则效率非常高). 1536微码介于1000到10000个测试尺寸之间.在您进行编辑之前,我预测慢到快的临界值大约在您的循环中共有1536条指令.直到远远超过1536条指令,它才不会减慢速度,因此我认为我们可以排除uop-cache的影响.这不是我想的那么简单的问题. :)

The uop cache size is 32 sets of 8 ways, with 6 uops per line, for a total of 1536 uops (if every line is packed with 6 uops; perfect efficiency). 1536 uops is between your 1000 and 10000 test sizes. Before your edit, I predicted that the cutoff for slow to fast would be right around 1536 total instructions in your loop. It doesn't slow down at all until well beyond 1536 instructions, so I think we can rule out uop-cache effects. This isn't as simple a question as I thought. :)

从uop缓存(较小的代码大小)而不是x86指令解码器(较大的循环)运行,这意味着在识别jmp指令的阶段之前,流水线阶段较少.因此,即使预测正确,我们也可以预期来自不断跳跃的气泡会更小.

Running from the uop-cache (small code size) instead of the x86 instruction decoders (large loops) means that there are fewer pipeline stages before the stage that recognizes jmp instructions. So we might expect the bubbles from a constant stream of jumps to be smaller, even though they're predicted correctly.

从解码器运行会带来较大的分支错误预测损失(例如可能是20个周期而不是15个周期),但是这些并不是错误预测的分支.

Running from the decoders is supposed to give a larger branch mispredict penalty (like maybe 20 cycles instead of 15), but these aren't mispredicted branches.

即使CPU不需要预测分支是否被采用,它可能仍会使用分支预测资源来预测代码块在解码之前包含所采用的分支.

Even though the CPU doesn't need to predict whether the branch is taken or not, it might still use branch-prediction resources to predict that a block of code contains a taken branch before it's decoded.

缓存特定代码块中存在分支及其目标地址的事实,使前端可以在实际解码jmp rel32编码之前开始从分支目标中获取代码.请记住,解码可变长度的x86指令很困难:在解码上一条指令之前,您不知道哪条指令从哪里开始.因此,您不仅可以对指令流进行模式匹配,以在获取指令流后立即查找无条件的跳转/调用.

Caching the fact that there is a branch in a certain block of code, and its target address, allows the frontend to start fetching code from the branch target before the jmp rel32 encoding is actually decoded. Remember that decoding variable-length x86 instructions is hard: you don't know where one instruction starts until the previous one is decoded. So you can't just pattern-match the instruction stream looking for unconditional jumps / calls as soon as its fetched.

我目前的理论是,当用完分支目标缓冲区条目后,您的速度会变慢.

另请参见分支目标缓冲区检测到哪些分支预测错误?的答案很好,并在 Realworldtech线程.

See also What branch misprediction does the Branch Target Buffer detect? which has a nice answer, and discussion in this Realworldtech thread.

非常重要的一点:BTB根据下一步要提取的块来进行预测,而不是根据提取块内特定分支的确切目的地进行预测.因此,不必预测获取块中所有分支的目标,而是 CPU只是需要预测下一次提取的地址.

One very important point: the BTB predicts in terms of which block to fetch next, rather than the exact destination of a specific branch within a fetch block. So instead of having to predict targets for all branches in a fetch block, the CPU just needs to predict the address of the next fetch.

是的,当运行诸如xor-zeroing之类的非常高吞吐量的东西时,内存带宽可能会成为瓶颈,但是jmp却遇到了另一个瓶颈. CPU将有时间从内存中获取42B,但这不是它的工作.预取可以很容易地跟上每3个时钟2个字节的速度,因此L1 I高速缓存未命中应该接近于零.

Yes, memory bandwidth can be a bottleneck when running very high throughput stuff like xor-zeroing, but you're hitting a different bottleneck with jmp. The CPU would have time to fetch 42B from memory, but that's not what it's doing. Prefetch can easily keep up with 2 bytes per 3 clocks, so there should be near-zero L1 I-cache misses.

xor具有/不具有REX测试的情况下,如果您使用足够大的循环进行测试以不适合L3缓存,则实际上主内存带宽可能一直是瓶颈.在约3GHz的CPU上,每个周期消耗4 * 2B,这大约在DDR3-1600MHz的25GB/s速度下最大.不过,即使是L3高速缓存,其速度也足以使其每个周期保持4 * 3B的速度.

In your xor with/without REX test, main memory bandwidth might actually have been the bottleneck there if you tested with a large enough loop to not fit in L3 cache. I consumes 4 * 2B per cycle on a ~3GHz CPU, which does just about max out the 25GB/s of DDR3-1600MHz. Even the L3 cache would be fast enough to keep up with 4 * 3B per cycle, though.

有趣的是主内存带宽是瓶颈;最初,我猜想解码(以16字节为一组)将成为3字节XOR的瓶颈,但我想它们足够小.

That's interesting that main memory BW is the bottleneck; I initially guessed that decode (in blocks of 16 bytes) would be the bottleneck for 3-byte XORs, but I guess they're small enough.

还请注意,测量核心时钟周期中的时间更为正常.但是,我猜想,当您查看内存时,以ns为单位的测量值将很有用,因为低时钟速度可节省功耗,从而改变了核心时钟速度与内存速度的比值. (即,在最低CPU时钟速度下,内存瓶颈就不再是问题了.)

Also note that its a lot more normal to measure times in core clock cycles. However, your measurements in ns are useful when you're looking at memory, I guess, because low clock speeds for power-saving change the ratio of core clock speed to memory speed. (i.e. memory bottlenecks are less of a problem at minimum CPU clock speed.)

要在时钟周期内进行基准测试,请使用perf stat ./a.out .还有其他一些有用的性能计数器,它们对于理解性能特征是必不可少的.

For benchmarking in clock cycles, use perf stat ./a.out. There are other useful performance counters which are essential to trying to understand the performance characteristics.

有关perf计数器,请参见 x86-64相对jmp性能 Core2(每个jmp 8个周期)和一些未知的微体系结构(每个jmp 10c)的结果.

See x86-64 Relative jmp performance for perf-counter results from Core2 (8 cycles per jmp), and some unknown microarchitecture where it's ~10c per jmp.

即使在或多或少的白盒条件下,也很难理解现代CPU性能特征的详细信息(请阅读Intel的优化手册,以及他们发布的有关CPU内部的内容).如果您坚持要进行黑盒测试,而您却不阅读有关新CPU设计的arstechnica文章,或者诸如David Kanter的

The details of modern CPU performance characteristics are hard enough to understand even under more or less white-box conditions (reading Intel's optimization manual, and what they've published regarding CPU internals). You're going to get stuck early and often if you insist on black-box testing where you don't read stuff like arstechnica articles about the new CPU design, or maybe some more detailed stuff like David Kanter's Haswell microarch overview, or the similar Sandybridge writeup I linked earlier.

如果尽早地被卡住并且经常没事,并且您很开心,那么就一定要继续做自己在做的事情.但是,如果您不知道这些详细信息(例如在这种情况下),那么人们就很难回答您的问题. :/例如该答案的第一个版本假定您已经阅读了足够的知识,知道uop缓存是什么.

If getting stuck early and often is ok and you're having fun, then by all means keep doing what you're doing. But it makes it harder for people to answer your questions if you don't know those details, like in this case. :/ e.g. my first version of this answer assumed you had read enough to know what the uop cache was.

这篇关于慢速jmp指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆