确切地说,x86 uops是如何安排的? [英] How are x86 uops scheduled, exactly?

查看:153
本文介绍了确切地说,x86 uops是如何安排的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现代x86 CPU将传入的指令流分解为微操作(uops 1 ),然后调度这些uops

Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops1) and then schedule these uops out-of-order as their inputs become ready. While the basic idea is clear, I'd like to know the specific details of how ready instructions are scheduled, since it impacts micro-optimization decisions.

例如,采取以下玩具循环 2 :

For example, take the following toy loop2:

top:
lea eax, [ecx + 5]
popcnt eax, eax
add edi, eax
dec ecx
jnz top

这基本上实现了循环(具有以下对应关系:eax -> total, c -> ecx):

this basically implements the loop (with the following correspondence: eax -> total, c -> ecx):

do {
  total += popcnt(c + 5);
} while (--c > 0);

我熟悉通过查看uop故障,依赖项链延迟等来优化任何小循环的过程.在上面的循环中,我们只有一个承载的依赖链:dec ecx.循环的前三个指令(leaimuladd)是依赖关系链的一部分,该依赖关系链重新开始每个循环.

I'm familiar with the process of optimizing any small loop by looking at the uop breakdown, dependency chain latencies and so on. In the loop above we have only one carried dependency chain: dec ecx. The first three instructions of the loop (lea, imul, add) are part of a dependency chain that starts fresh each loop.

最后的decjne融合在一起.因此,我们总共有4个融合域uops,以及仅有的一个循环携带的依赖链,其延迟为1个周期.因此,根据该标准,似乎循环可以1个循环/迭代执行.

The final dec and jne are fused. So we have a total of 4 fused-domain uops, and one only loop-carried dependency chain with a latency of 1 cycle. So based on that criteria, it seems that the loop can execute at 1 cycle/iteration.

但是,我们也应该查看端口压力:

However, we should look at the port pressure too:

  • lea可以在端口1和5上执行
  • popcnt可以在端口1上执行
  • add可以在端口0、1、5和6上执行
  • 预测的jnz在端口6上执行
  • The lea can execute on ports 1 and 5
  • The popcnt can execute on port 1
  • The add can execute on port 0, 1, 5 and 6
  • The predicted-taken jnz executes on port 6

因此,要达到1个周期/迭代,您几乎需要执行以下操作:

So to get to 1 cycle / iteration, you pretty much need the following to happen:

  • popcnt 必须在端口1(唯一可以在其执行的端口)上执行
  • lea 必须在端口5上执行(并且永远不在端口1上执行)
  • add 必须必须在端口0上执行,并且决不能在它可以在其上执行的其他三个端口中的任何一个上执行
  • jnz反正只能在端口6上执行
  • The popcnt must execute on port 1 (the only port it can execute on)
  • The lea must execute on port 5 (and never on port 1)
  • The add must execute on port 0, and never on any of other three ports it can execute on
  • The jnz can only execute on port 6 anyway

那是很多条件!如果只是随机安排指令,则吞吐量可能会差很多.例如,add的75%将进入端口1、5或6,这将使popcntleajnz延迟一个周期.同样,对于可以进入2个端口的lea,其中一个与popcnt共享.

That's a lot of conditions! If instructions just got scheduled randomly, you could get a much worse throughput. For example, 75% the add would go to port 1, 5 or 6, which would delay the popcnt, lea or jnz by one cycle. Similarly for the lea which can go to 2 ports, one shared with popcnt.

另一方面,IACA报告的结果非常接近最优,每次迭代1.05个周期:

IACA on the other hand reports a result very close to optimal, 1.05 cycles per iteration:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - l.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.05 Cycles       Throughput Bottleneck: FrontEnd, Port0, Port1, Port5

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  | 0.9  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea eax, ptr [ecx+0x5]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | popcnt eax, eax
|   1    | 0.1       |     |           |           |     | 0.1 | 0.9 |     | CP | add edi, eax
|   1    | 0.9       |     |           |           |     |     | 0.1 |     | CP | dec ecx
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffff4

它几乎反映了我上面提到的必要的理想"调度,但有很小的偏差:它显示了addlea窃取端口5的次数为10个周期中的1个.它也不知道融合分支将去往端口6,因为它被预测为已使用,因此它将分支的大部分uops放在端口0上,并将add的大部分uops放在端口6上. ,而不是相反.

It pretty much reflects the necessary "ideal" scheduling I mentioned above, with a small deviation: it shows the add stealing port 5 from the lea on 1 out of 10 cycles. It also doesn't know that the fused branch is going to go to port 6 since it is predicted taken, so it puts most of the uops for the branch on port 0, and most of the uops for the add on port 6, rather than the other way around.

目前尚不清楚IACA报告的超出最佳状态的0.05个循环是否是经过深入,准确的分析的结果,还是它所使用算法的洞察力较弱的结果(例如,分析了固定数目的循环,还是一个错误或其他任何东西.它认为将到达非理想端口的uop的0.1分数也是如此.还不清楚是否有人解释对方-我认为错误地分配端口1十次将导致每次迭代的周期计数为11/10 = 1.1周期,但是我还没有得出实际的下游结果-平均影响可能较小.或者也可以四舍五入(0.05 == 0.1至1小数位).

It's not clear if the extra 0.05 cycles that IACA reports over the optimal is the result of some deep, accurate analysis, or a less insightful consequence of the algorithm it uses, e.g., analyzing the loop over a fixed number of cycles, or just a bug or whatever. The same goes for the 0.1 fraction of a uop that it thinks will go to the non-ideal port. It is also not clear if one explains the other - I would think that mis-assigning a port 1 out of 10 times would cause a cycle count of 11/10 = 1.1 cycles per iteration, but I haven't worked out the actual downstream results - maybe the impact is less on average. Or it could just be rounding (0.05 == 0.1 to 1 decimal place).

那么现代x86 CPU实际如何调度?特别是:

So how do modern x86 CPUs actually schedule? In particular:

  1. 在预订站中准备好多个微件时,按什么顺序将它们安排到港口?
  2. 当uop可以进入多个端口(如上例中的addlea)时,如何确定选择哪个端口?
  3. 如果任何答案都涉及到 oldest 之类的概念以在uops中进行选择,它的定义是什么?自交付给RS以来的年龄?准备好之后的年龄?纽带如何断裂?程序顺序是否曾经出现过?
  1. When multiple uops are ready in the reservation station, in what order are they scheduled to ports?
  2. When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?
  3. If any of the answers involve a concept like oldest to choose among uops, how is it defined? Age since it was delivered to the RS? Age since it became ready? How are ties broken? Does program order ever come into it?

Skylake上的结果

让我们在Skylake上测量一些实际结果以检查哪些答案可以解释实验证据,因此这是在Skylake盒子上的一些实际测量结果(来自perf).令人困惑的是,我将对我的仅在一个端口上执行"指令切换为使用imul,因为它具有多种变体,包括3参数版本,允许您对源和目标使用不同的寄存器.在尝试构造依赖关系链时,这非常方便.它还避免了popcnt具有的整个对目标的不正确依赖性".

Results on Skylake

Let's measure some actual results on Skylake to check which answers explain the experimental evidence, so here are some real-world measured results (from perf) on my Skylake box. Confusingly, I'm going switch to using imul for my "only executes on one port" instruction, since it has many variants, including 3-argument versions that allow you to use different registers for the source(s) and destination. This is very handy when trying to construct dependency chains. It also avoids the whole "incorrect dependency on destination" that popcnt has.

让我们从简单的(?)情况开始,即指令是相对独立的-除了琐碎的循环计数器(如循环计数器)以外,没有任何依赖链.

Let's start by looking at the simple (?) case that the instructions are relatively independent - without any dependency chains other than trivial ones like the loop counter.

这是一个4 uop循环(仅执行3个uops),压力适中.所有说明都是独立的(请勿共享任何来源或目的地). add原则上可以窃取imul或dec所需的p6所需的p1:

Here's a 4 uop loop (only 3 executed uops) with mild pressure. All instructions are independent (don't share any sources or destinations). The add could in principle steal the p1 needed by the imul or p6 needed by the dec:

instr   p0 p1 p5 p6 
xor       (elim)
imul        X
add      X  X  X  X
dec               X

top:
    xor  r9, r9
    add  r8, rdx
    imul rax, rbx, 5
    dec esi
    jnz top

The results is that this executes with perfect scheduling at 1.00 cycles / iteration:

   560,709,974      uops_dispatched_port_port_0                                     ( +-  0.38% )
 1,000,026,608      uops_dispatched_port_port_1                                     ( +-  0.00% )
   439,324,609      uops_dispatched_port_port_5                                     ( +-  0.49% )
 1,000,041,224      uops_dispatched_port_port_6                                     ( +-  0.00% )
 5,000,000,110      instructions:u            #    5.00  insns per cycle          ( +-  0.00% )
 1,000,281,902      cycles:u   

                                           ( +-  0.00% )

如预期的那样,p1dec/jnz分别充分利用了p1p6,然后add大约在剩余可用端口之间发出了一半的 .请大致注意 -实际比率分别为56%和44%,并且该比率在每次运行中都相当稳定(请注意+- 0.49%变体).如果我调整循环对齐方式,则分割会更改(32B对齐方式为53/46,更像32B + 4对齐方式为57/42).现在,除了循环中imul的位置之外,我们什么都不会改变:

As expected, p1 and p6 are fully utilized by the imul and dec/jnz respectively, and then the add issues roughly half and half between the remaining available ports. Note roughly - the actual ratio is 56% and 44%, and this ratio is pretty stable across runs (note the +- 0.49% variation). If I adjust the loop alignment, the split changes (53/46 for 32B alignment, more like 57/42 for 32B+4 alignment). Now, we if change nothing except the position of imul in the loop:

top:
    imul rax, rbx, 5
    xor  r9, r9
    add  r8, rdx
    dec esi
    jnz top

然后突然p0/p5的比例恰好是50%/50%,差异为0.00%:

Then suddenly the p0/p5 split is exactly 50%/50%, with 0.00% variation:

   500,025,758      uops_dispatched_port_port_0                                     ( +-  0.00% )
 1,000,044,901      uops_dispatched_port_port_1                                     ( +-  0.00% )
   500,038,070      uops_dispatched_port_port_5                                     ( +-  0.00% )
 1,000,066,733      uops_dispatched_port_port_6                                     ( +-  0.00% )
 5,000,000,439      instructions:u            #    5.00  insns per cycle          ( +-  0.00% )
 1,000,439,396      cycles:u                                                        ( +-  0.01% )

所以这已经很有趣了,但是很难说是怎么回事.确切的行为可能取决于循环进入时的初始条件,并且对循环内的排序很敏感(例如,因为使用了计数器).此示例表明,正在进行的不仅仅是随机"或愚蠢"调度.特别是,如果仅从循环中删除imul指令,则会得到以下信息:

So that's already interesting, but it's hard to tell what's going on. Perhaps the exact behavior depends on the initial conditions at loop entry and is sensitive to ordering within the loop (e.g., because counters are used). This example shows that something more than "random" or "stupid" scheduling is going on. In particular, if you just eliminate the imul instruction from the loop, you get the following:

   330,214,329      uops_dispatched_port_port_0                                     ( +-  0.40% )
   314,012,342      uops_dispatched_port_port_1                                     ( +-  1.77% )
   355,817,739      uops_dispatched_port_port_5                                     ( +-  1.21% )
 1,000,034,653      uops_dispatched_port_port_6                                     ( +-  0.00% )
 4,000,000,160      instructions:u            #    4.00  insns per cycle          ( +-  0.00% )
 1,000,235,522      cycles:u                                                      ( +-  0.00% )

在这里,add现在大致均匀地分布在p0p1p5之间-因此imul的存在确实影响了add的调度:这不仅是后果一些避免端口1"规则.

Here, the add is now roughly evenly distributed among p0, p1 and p5 - so the presence of the imul did affect the add scheduling: it wasn't just a consequence of some "avoid port 1" rule.

请注意,由于xor是归零习惯,并且在重命名器中已消除,因此总端口压力仅为3 uops/周期.让我们尝试最大压力为4 oups.我希望上面引入的任何机制也能够完美地安排此计划.我们仅将xor r9, r9更改为xor r9, r10,因此它不再是调零习惯.我们得到以下结果:

Note here that total port pressure is only 3 uops/cycle, since the xor is a zeroing idiom and is eliminated in the renamer. Let's try with the max pressure of 4 uops. I expect whatever mechanism kicked in above to able to perfectly schedule this also. We only change xor r9, r9 to xor r9, r10, so it is no longer a zeroing idiom. We get the following results:

top:
    xor  r9, r10
    add  r8, rdx
    imul rax, rbx, 5
    dec esi
    jnz top

       488,245,238      uops_dispatched_port_port_0                                     ( +-  0.50% )
     1,241,118,197      uops_dispatched_port_port_1                                     ( +-  0.03% )
     1,027,345,180      uops_dispatched_port_port_5                                     ( +-  0.28% )
     1,243,743,312      uops_dispatched_port_port_6                                     ( +-  0.04% )
     5,000,000,711      instructions:u            #    2.66  insns per cycle            ( +-  0.00% )
     1,880,606,080      cycles:u                                                        ( +-  0.08% )

糟糕!调度程序没有均匀地调度p0156上的所有内容,而没有充分利用p0(它仅执行约49%的周期),因此p1p6被超额订阅,因为它们都执行了所需的 imuldec/jnz的操作.我认为这种行为与hayesti在其答案中指出的基于计数器的压力指示器以及在发布时而非执行时将 uops分配给端口相符两者兼而有之 hayesti和Peter Cordes提到过.行为 3 使得执行最老的就绪oups 规则几乎没有效果.如果uops不是在讨论中绑定到执行端口,而是在执行时绑定,则此最早的"规则将在一次迭代后解决上述问题-一旦将一个imul和一个dec/jnz保留了一次迭代,它们将始终比竞争的xoradd指令要早,因此应始终优先安排.我正在学习的一件事是,如果在发布时分配了端口,则此规则无济于事,因为端口是在发布时预先确定的.我想这仍然有利于支持长期依赖链的一部分的指令(因为这些指令往往会落在后面),但这并不是我所认为的所有治愈方法.

Oops! Rather than evenly scheduling everything across p0156, the scheduler has underused p0 (it's only executing something ~49% of cycles), and hence p1 and p6 are oversubcribed because they are executing both their required ops of imul and dec/jnz. This behavior, I think is consistent with a counter-based pressure indicator as hayesti indicated in their answer, and with uops being assigned to a port at issue-time, not at execution time as both hayesti and Peter Cordes mentioned. That behavior3 makes the execute the oldest ready uops rule not nearly as effective. If uops weren't bound to execution ports at issue, but rather at execution, then this "oldest" rule would fix the problem above after one iteration - once one imul and one dec/jnz got held back for a single iteration, they will always be older than the competing xor and add instructions, so should always get scheduled first. One thing I am learning though, is that if ports are assigned at issue time, this rule doesn't help because the ports are pre-determined at issue time. I guess it still helps a bit in favoring instructions which are part of long dependecy chains (since these will tend to fall behind), but it's not the cure-all I thought it was.

这似乎也可以解释上面的结果:p0被赋予比实际更大的压力,因为dec/jnz组合可以在理论上在p06上执行. 事实上,因为预测分支仅发生在p6,但是信息可能无法馈入压力平衡算法,因此计数器倾向于对p016施加相同的压力,这意味着addxor的散布方式与最佳散布方式不同.

That also seems to be a explain the results above: p0 gets assigned more pressure than it really has because the dec/jnz combo can in theory execute on p06. In fact because the branch is predicted taken it only ever goes to p6, but perhaps that info can't feed into the pressure balancing algorithm, so the counters tend to see equal pressure on p016, meaning that the add and the xor get spread around differently than optimal.

也许我们可以通过稍微展开循环来进行测试,以使jnz的影响较小...

Probably we can test this, by unrolling the loop a bit so the jnz is less of a factor...

1 好吧,它正确编写为μops,但这会削弱搜索能力,并且实际上要键入我通常采用复制粘贴的μ"字符网页中的字符.

1 OK, it is properly written μops, but that kills search-ability and to actually type the "μ" character I'm usually resorting to copy-pasting the character from a webpage.

2 我最初在循环中使用了imul而不是popcnt,但是,令人难以置信的是, IACA并没有

2 I had originally used imul instead of popcnt in the loop, but, unbelievably, IACA doesn't support it!

3 请注意,我并不是说这不是一个糟糕的设计或任何东西-可能有很好的硬件原因,使得调度程序无法在执行时轻松地做出所有决定.

3 Please note that I'm not suggesting this is a poor design or anything - there are probably very good hardware reasons why the scheduler cannot easily make all its decisions at execution time.

推荐答案

您的问题很难回答,原因有几个:

Your questions are tough for a couple of reasons:

  1. 答案很大程度上取决于处理器的微体系结构 世代之间差异很大.
  2. 这些都是细粒度的细节,英特尔通常不会向公众发布这些细节.
  1. The answer depends a lot on the microarchitecture of the processor which can vary significantly from generation to generation.
  2. These are fine-grained details which Intel doesn't generally release to the public.

尽管如此,我会尽力回答...

Nevertheless, I'll try to answer...

预订站中准备好多个微件后,它们将按什么顺序安排到港口?

它应该是最古老的(见下文),但是您的里程可能会有所不同. P6微体系结构(在Pentium Pro 2和3中使用)使用了具有五个调度程序的预留站(每个执行端口一个).调度程序将优先级指针用作开始扫描准备分发的uops的位置.这只是伪FIFO,因此最老的就绪指令不一定总是被调度的.在NetBurst微体系结构(用于Pentium 4)中,他们放弃了统一预留站,而是使用了两个uop队列.这些是适当的折叠优先级队列,因此可以确保调度程序获得最早的就绪指令.核心体系结构返回到保留站,我会冒昧地以为他们使用了崩溃的优先级队列,但是我找不到来源来证实这一点.如果有人有明确的答案,那我真是耳目一新.

It should be the oldest [see below], but your mileage may vary. The P6 microarchitecture (used in the Pentium Pro, 2 & 3) used a reservation station with five schedulers (one per execution port); the schedulers used a priority pointer as a place to start scanning for ready uops to dispatch. It was only pseudo FIFO so it's entirely possible that the oldest ready instruction was not always scheduled. In the NetBurst microarchitecture (used in Pentium 4), they ditched the unified reservation station and used two uop queues instead. These were proper collapsing priority queues so the schedulers were guaranteed to get the oldest ready instruction. The Core architecture returned to a reservation station and I would hazard an educated guess that they used the collapsing priority queue, but I can't find a source to confirm this. If somebody has a definitive answer, I'm all ears.

当一个uop可以进入多个端口时(如上例中的add和lea),如何确定选择哪个端口?

要知道这很棘手.我能找到的最好的方法是英特尔的专利描述了这种机制.本质上,它们为具有冗余功能单元的每个端口保留一个计数器.当微指令离开前端到预留站时,它们被分配了一个调度端口.如果必须在多个冗余执行单元之间进行决策,则可以使用计数器来平均分配工作.当微指令分别进入和离开保留站时,计数器会递增或递减.

That's tricky to know. The best I could find is a patent from Intel describing such a mechanism. Essentially, they keep a counter for each port that has redundant functional units. When the uops leave the front end to the reservation station, they are assigned a dispatch port. If it has to decide between multiple redundant execution units, the counters are used to distribute the work evenly. Counters are incremented and decremented as uops enter and leave the reservation station respectively.

自然地,这只是一种试探法,并不能保证一个完美的无冲突时间表,但是,我仍然可以看到它与您的玩具示例一起使用.只能进入一个端口的指令最终会影响调度程序,将受较少限制的"指令发送到其他端口.

Naturally this is just a heuristic and does not guarantee a perfect conflict-free schedule, however, I could still see it working with your toy example. The instructions which can only go to one port would ultimately influence the scheduler to dispatch the "less restricted" uops to other ports.

无论如何,专利的存在并不一定意味着采用了该想法(尽管如此,其中一位作者还是奔腾4的技术负责人,所以谁知道呢?)

In any case, the presence of a patent doesn't necessarily imply that the idea was adopted (although that said, one of the authors was also a tech lead of the Pentium 4, so who knows?)

如果任何答案都涉及到最古老的概念,那么它是如何定义的?自交付给RS以来的年龄?准备好之后的年龄?纽带如何断裂?程序顺序是否曾经出现过?

由于将uops按顺序插入到保留站中,因此这里的最旧确实是指它进入保留站的时间,即程序顺序中最古老的时间.

Since uops are inserted into the reservation station in order, oldest here does indeed refer to time it entered the reservation station, i.e. oldest in program order.

顺便说一句,我会把那些IACA的结果看得一头雾水,因为它们可能无法反映实际硬件的细微差别.在Haswell上,有一个名为 uops_executed_port 的硬件计数器,它可以告诉您在线程中向端口0-7发出uops问题的周期是多少.也许您可以利用它们来更好地了解您的程序?

By the way, I would take those IACA results with a grain of salt as they may not reflect the nuances of the real hardware. On Haswell, there is a hardware counter called uops_executed_port that can tell you how many cycles in your thread were uops issues to ports 0-7. Maybe you could leverage these to get a better understanding of your program?

这篇关于确切地说,x86 uops是如何安排的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆