了解 lfence 对具有两个长依赖链的循环的影响,以增加长度 [英] Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

查看:14
本文介绍了了解 lfence 对具有两个长依赖链的循环的影响,以增加长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在玩

如何解释Cycles:u lfence的值?
我本来希望它们与 Cycles:u no lfence 的类似,因为单个 lfence 应该只阻止第一次迭代对两个块并行执行.
我不认为这是由于 lfence 开销,因为我认为对于所有 T 来说这应该是恒定的.

在处理代码的静态分析时,我想修正我的格式出了什么问题.

<小时>

支持源文件存储库.

解决方案

我将针对两种代码(有和没有 lfence)的 T = 1 的情况进行分析.然后,您可以将其扩展为 T 的其他值.您可以参考英特尔优化手册的图 2.4 以获取视觉效果.

因为只有一个容易预测的分支,如果后端停止,前端才会停止.Haswell 中的前端是 4 宽的,这意味着最多可以从 IDQ(指令解码队列,它只是一个包含有序融合域 uops 的队列,也称为 uop 队列)发出 4 个融合的 uops 到保留站 (RS) 整个调度程序.每个 imul 都被解码成一个无法融合的 uop.指令 dec ecxjnz .loop 在前端被宏融合为单个 uop.微融合和宏融合之间的区别之一是,当调度程序将宏融合 uop(非微融合)分派给它分配给的执行单元时,它会作为单个 uop 分派.相比之下,微融合 uop 需要拆分为其组成的 uop,每个组成的 uop 必须单独分派到执行单元.(但是,拆分微融合 uops 发生在 RS 的入口处,而不是在派遣时,请参阅@Peter 回答中的脚注 2).lfence 被解码为 6 个 uops.识别微融合只在后端重要,在这种情况下,循环中没有微融合.

由于循环分支很容易预测,而且迭代次数相对较多,我们可以在不影响准确性的情况下假设分配器始终能够在每个周期分配 4 个 uops.换句话说,调度程序每个周期将收到 4 个 uops.由于没有微融合,每个 uop 将作为单个 uop 发送.

imul 只能由 Slow Int 执行单元执行(见图 2.4).这意味着执行 imul uops 的唯一选择是将它们分派到端口 1.在 Haswell 中,Slow Int 被很好地流水线化,因此可以分派单个 imul每个周期.但是乘法的结果需要三个周期才能用于任何需要的指令(回写阶段是流水线分派阶段的第三个周期).所以对于每个依赖链,每3个周期最多可以调度一个imul.

因为dec/jnz 被预测采用,所以唯一可以执行它的执行单元是端口 6 上的 Primary Branch.

所以在任何给定的周期,只要 RS 有空间,它就会收到 4 uop.但是什么样的 uops 呢?让我们检查一下没有围栏的循环:

imul eax, eaximul edx, edxdec ecx/jnz .loop(宏融合)

有两种可能:

  • 两个 imul 来自同一次迭代,一个 imul 来自相邻迭代,一个 dec/jnz 来自这两次迭代之一.
  • 一个dec/jnz来自一次迭代,两个imul来自下一次迭代,一个dec/jnz来自同一次迭代.

因此在任何循环的开始,RS 将从每个链中接收至少一个 dec/jnz 和至少一个 imul.同时,在同一个周期中,从 RS 中已经存在的那些 uops 中,调度程序将执行以下两个操作之一:

  • 将最旧的 dec/jnz 分派到端口 6,并将准备好的最旧的 imul 分派到端口 1.总共 2 uop.
  • 因为Slow Int 有3 个周期的延迟,但只有两个链,对于3 个周期的每个周期,RS 中没有imul 准备好执行.然而,RS中总​​是至少有一个dec/jnz.所以调度器可以调度它.总共 1 uop.

现在我们可以计算在任何给定周期 N 结束时 RS 中的预期微指令数 XN:

XN = XN-1 +(在周期 N 开始时要在 RS 中分配的 uops 数)-(预期的 uops 数将在循环开始时分派 N)
= XN-1 + 4 - ((0+1)*1/3 + (1+1)*2/3)
= XN-1 + 12/3 - 5/3
= XN-1 + 7/3 对于所有 N > 0

递归的初始条件是X0 = 4.这是一个简单的递归,可以通过展开XN-1来解决.

XN = 4 + 2.3 * N 对于所有 N >= 0

Haswell 中的 RS 有 60 个条目.我们可以确定 RS 预期变满的第一个周期:

60 = 4 + 7/3 * N
N = 56/2.3 = 24.3

因此,在周期 24.3 结束时,RS 预计已满.这意味着在周期 25.3 开始时,RS 将无法接收任何新的 uops.现在考虑的迭代次数 I 决定了您应该如何进行分析.由于依赖链至少需要 3*I 个周期才能执行,因此大约需要 8.1 次迭代才能达到周期 24.3.所以如果迭代次数大于8.1,这里就是这种情况,就需要分析24.3循环之后的情况.

调度程序在每个周期以下列速率分派指令(如上所述):

12212212..

但是分配器不会在 RS 中分配任何 uops,除非至少有 4 个可用条目.否则,它不会在以次优吞吐量发出 uops 时浪费电力.但是,只有在每 4 个周期开始时,RS 中才至少有 4 个空闲条目.因此,从周期 24.3 开始,分配器预计每 4 个周期中会有 3 个停止.

对正在分析的代码的另一个重要观察是,永远不会发生超过 4 个可分派的 uop,这意味着每个周期离开其执行单元的平均 uop 数不大于 4.在大多数 4 uop 可以从重新排序缓冲区 (ROB) 中退出.这意味着 ROB 永远不会在关键路径上.换句话说,性能由调度吞吐量决定.

我们现在可以很容易地计算 IPC(每个周期的指令).ROB 条目如下所示:

imul eax, eax - Nimul edx, edx - N + 1dec ecx/jnz .loop - Mimul eax, eax - N + 3imul edx, edx - N + 4dec ecx/jnz .loop - M + 1

右侧的列显示指令可以退出的周期.退休按顺序发生,并受关键路径的延迟限制.这里每个依赖链具有相同的路径长度,因此都构成了两个长度为 3 个循环的相等的关键路径.因此,每 3 个周期,可以退出 4 条指令.因此,IPC 为 4/3 = 1.3,CPI 为 3/4 = 0.75.这远小于 4 的理论最佳 IPC(即使不考虑微观和宏观融合).因为退休是按顺序发生的,所以退休行为将是相同的.

我们可以同时使用 perf 和 IACA 检查我们的分析.我将讨论 perf.我有一个 Haswell CPU.

perf stat -r 10 -e cycle:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-nolfence'./main-1-nolfence' 的性能计数器统计信息(10 次运行):30,01,556 次循环:u ( +- 0.00% )40,00,005 条指令:u # 每个周期 1.33 个 insns ( +- 0.00% )0 RESOURCE_STALLS.ROB23,42,246 UOPS_ISSUED.ANY ( +- 0.26% )22,49,892 RESOURCE_STALLS.RS ( +- 0.00% )0.001061681 秒时间过去 ( +- 0.48% )

有 100 万次迭代,每次大约需要 3 个周期.每次迭代包含 4 条指令,IPC 为 1.33.RESOURCE_STALLS.ROB 显示分配器因完全 ROB 而停顿的周期数.这当然永远不会发生.UOPS_ISSUED.ANY 可用于计算向 RS 发出的 uops 数量以及分配器停止的周期数(没有具体原因).第一个很简单(未在 perf 输出中显示);100万*3=300万+小噪音.后者更有趣.它显示大约 73% 的分配器由于完整的 RS 而停滞,这与我们的分析相符.RESOURCE_STALLS.RS 计算分配器因 RS 满而停滞的周期数.这与 UOPS_ISSUED.ANY 很接近,因为分配器不会因为任何其他原因而停止(尽管出于某种原因,差异可能与迭代次数成正比,我将不得不查看 T 的结果>1).

对没有lfence的代码的分析可以扩展以确定如果在两个imul之间添加lfence会发生什么.我们先来看看perf的结果(可惜IACA不支持lfence):

perf stat -r 10 -e cycle:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-lfence'./main-1-lfence' 的性能计数器统计信息(10 次运行):1,32,55,451 个周期:u ( +- 0.01% )50,00,007 条指令:u # 每个周期 0.38 insns ( +- 0.00% )0 RESOURCE_STALLS.ROB1,03,84,640 UOPS_ISSUED.ANY ( +- 0.04% )0 RESOURCE_STALLS.RS0.004163500 秒时间过去 ( +- 0.41% )

观察到循环次数增加了大约 1000 万次,即每次迭代 10 个循环.周期数并不能告诉我们太多.退休指令的数量增加了100万,这是预期的.我们已经知道 lfence 不会使指令完成得更快,所以 RESOURCE_STALLS.ROB 不应该改变.UOPS_ISSUED.ANYRESOURCE_STALLS.RS 特别有趣.在此输出中,UOPS_ISSUED.ANY 计算周期数,而不是 uops.还可以计算 uops 的数量(使用 cpu/event=0x0E,umask=0x1,name=UOPS_ISSUED.ANY/u 而不是 cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u) 并且每次迭代增加 6 uops(无融合).这意味着位于两个 imul 之间的 lfence 被解码为 6 个 uops.一百万美元的问题现在是这些 uops 做什么以及它们如何在管道中移动.

RESOURCE_STALLS.RS 为零.这意味着什么?这表明分配器,当它在 IDQ 中看到 lfence 时,它会停止分配,直到 ROB 中的所有当前 uops 都退休.换句话说,分配器不会在 lfence 之后分配 RS 中的条目,直到 lfence 退出.由于循环体仅包含 3 个其他 uop,因此 60 个条目的 RS 永远不会满.事实上,它总是几乎是空的.

现实中的 IDQ 不是一个简单的队列.它由多个可以并行运行的硬件结构组成.lfence 需要的 uops 数量取决于 IDQ 的确切设计.分配器也由许多不同的硬件结构组成,当它看到 IDQ 的任何结构的前面有一个 lfence uops 时,它会暂停从该结构进行分配,直到 ROB 为空.所以不同的uop使用不同的硬件结构.

UOPS_ISSUED.ANY 显示分配器在每次迭代约 9-10 个周期内未发出任何 uops.这里发生了什么?好吧,lfence 的一个用途是它可以告诉我们退出一条指令并分配下一条指令需要多长时间.可以使用以下汇编代码来做到这一点:

TIMES T 围栏

性能事件计数器不适用于 T 的小值.对于足够大的 T,通过测量 UOPS_ISSUED.ANY,我们可以确定退出每个 lfence 大约需要 4 个周期.那是因为 UOPS_ISSUED.ANY 每 5 个周期将增加大约 4 次.因此,每 4 个周期后,分配器发出另一个 lfence(它不会停止),然后再等待 4 个周期,依此类推.也就是说,根据指令,产生结果的指令可能需要 1 个或几个周期才能退出.IACA 总是假设退出一条指令需要 5 个周期.

我们的循环看起来像这样:

imul eax, eax围栏imul edx, edx十二月循环

lfence 边界的任何循环中,ROB 将包含以下指令,从 ROB 的顶部(最旧的指令)开始:

imul edx, edx - Ndec ecx/jnz .loop - Nimul eax, eax - N+1

其中 N 表示相应指令被分派的周期数.将要完成的最后一条指令(到达写回阶段)是imul eax, eax.这发生在周期 N+4.分配器停顿周期计数将在周期 N+1、N+2、N+3 和 N+4 期间递增.然而,在 imul eax, eax 退休之前,它还会再有 5 个周期.此外,在它退休后,分配器需要从 IDQ 中清除 lfence uop 并分配下一组指令,然后才能在下一个循环中进行调度.perf 输出告诉我们,每次迭代大约需要 13 个周期,并且分配器在这 13 个周期中的 10 个周期中停滞(因为 lfence).

问题中的图表仅显示最多 T=100 的循环数.但是,此时还有另一个(最终)膝盖.因此最好绘制最多 T=120 的周期以查看完整模式.

I was playing with the code in this answer, slightly modifying it:

BITS 64

GLOBAL _start

SECTION .text

_start:
 mov ecx, 1000000

.loop:

 ;T is a symbol defined with the CLI (-DT=...)

 TIMES T imul eax, eax
 lfence
 TIMES T imul edx, edx


 dec ecx
jnz .loop

 mov eax, 60           ;sys_exit
 xor edi, edi
 syscall

Without the lfence I the results I get are consistent with the static analysis in that answer.

When I introduce a single lfence I'd expect the CPU to execute the imul edx, edx sequence of the k-th iteration in parallel with the imul eax, eax sequence of the next (k+1-th) iteration.
Something like this (calling A the imul eax, eax sequence and D the imul edx, edx one):

|
| A
| D A
| D A
| D A
| ...
| D A
| D
|
V time

Taking more or less the same number of cycles but for one unpaired parallel execution.

When I measure the number of cycles, for the original and modified version, with taskset -c 2 ocperf.py stat -r 5 -e cycles:u '-x ' ./main-$T for T in the range below I get

T   Cycles:u    Cycles:u    Delta
    lfence      no lfence

10  42047564    30039060    12008504
15  58561018    45058832    13502186
20  75096403    60078056    15018347
25  91397069    75116661    16280408
30  108032041   90103844    17928197
35  124663013   105155678   19507335
40  140145764   120146110   19999654
45  156721111   135158434   21562677
50  172001996   150181473   21820523
55  191229173   165196260   26032913
60  221881438   180170249   41711189
65  250983063   195306576   55676487
70  281102683   210255704   70846979
75  312319626   225314892   87004734
80  339836648   240320162   99516486
85  372344426   255358484   116985942
90  401630332   270320076   131310256
95  431465386   285955731   145509655
100 460786274   305050719   155735555

How can the values of Cycles:u lfence be explained?
I would have expected them to be similar to those of Cycles:u no lfence since a single lfence should prevent only the first iteration from being executed in parallel for the two blocks.
I don't think it's due to the lfence overhead as I believe that should be constant for all Ts.

I'd like to fix what's wrong with my forma mentis when dealing with the static analysis of code.


Supporting repository with source files.

解决方案

I'll present an analysis for the case where T = 1 for both codes (with and without lfence). You can then extend this for other values of T. You can refer to Figure 2.4 of the Intel Optimization Manual for a visual.

Because there is only a single easily predicted branch, the frontend will only stall if the backend stalled. The frontend is 4-wide in Haswell, which means up to 4 fused uops can be issued from the IDQ (instruction decode queue, which is just a queue that holds in-order fused-domain uops, also called the uop queue) to the reservation station (RS) entires of the scheduler. Each imul is decoded into a single uop that cannot be fused. The instructions dec ecx and jnz .loop get macrofused in the frontend to a single uop. One of the differences between microfusion and macrofusion is that when the scheduler dispatches a macrofused uop (that are not microfused) to the execution unit it's assigned to, it gets dispatched as a single uop. In contrast, a microfused uop needs to be split into its constituent uops, each of which must be separately dispatched to an execution unit. (However, splitting microfused uops happens on entrance to the RS, not on dispatch, see Footnote 2 in @Peter's answer). lfence is decoded into 6 uops. Recognizing microfusion only matters in the backend, and in this case, there is no microfusion in the loop.

Since the loop branch is easily predictable and since the number of iterations is relatively large, we can just assume without compromising accuracy that the allocator will always be able to allocate 4 uops per cycle. In other words, the scheduler will receive 4 uops per cycle. Since there is no micorfusion, each uop will be dispatched as a single uop.

imul can only be executed by the Slow Int execution unit (see Figure 2.4). This means that the only choice for executing the imul uops is to dispatch them to port 1. In Haswell, the Slow Int is nicely pipelined so that a single imul can be dispatched per cycle. But it takes three cycles for the result of the multiplication be available for any instruction that requires (the writeback stage is the third cycle from the dispatch stage of the pipeline). So for each dependence chain, at most one imul can be dispatched per 3 cycles.

Becausedec/jnz is predicted taken, the only execution unit that can execute it is Primary Branch on port 6.

So at any given cycle, as long as the RS has space, it will receive 4 uops. But what kind of uops? Let's examine the loop without lfence:

imul eax, eax
imul edx, edx
dec ecx/jnz .loop (macrofused)

There are two possibilities:

  • Two imuls from the same iteration, one imul from a neighboring iteration, and one dec/jnz from one of those two iterations.
  • One dec/jnz from one iteration, two imuls from the next iteration, and one dec/jnz from the same iteration.

So at the beginning of any cycle, the RS will receive at least one dec/jnz and at least one imul from each chain. At the same time, in the same cycle and from those uops that are already there in the RS, the scheduler will do one of two actions:

  • Dispatch the oldest dec/jnz to port 6 and dispatch the oldest imul that is ready to port 1. That's a total of 2 uops.
  • Because the Slow Int has a latency of 3 cycles but there are only two chains, for each cycle of 3 cycles, no imul in the RS will be ready for execution. However, there is always at least one dec/jnz in the RS. So the scheduler can dispatch that. That's a total of 1 uop.

Now we can calculate the expected number of uops in the RS, XN, at the end of any given cycle N:

XN = XN-1 + (the number of uops to be allocated in the RS at the beginning of cycle N) - (the expected number of uops that will be dispatched at the beginning of cycle N)
= XN-1 + 4 - ((0+1)*1/3 + (1+1)*2/3)
= XN-1 + 12/3 - 5/3
= XN-1 + 7/3 for all N > 0

The initial condition for the recurrence is X0 = 4. This is a simple recurrence that can be solved by unfolding XN-1.

XN = 4 + 2.3 * N for all N >= 0

The RS in Haswell has 60 entries. We can determine the first cycle in which the RS is expected to become full:

60 = 4 + 7/3 * N
N = 56/2.3 = 24.3

So at the end of cycle 24.3, the RS is expected to be full. This means that at the beginning of cycle 25.3, the RS will not be able to receive any new uops. Now the number of iterations, I, under consideration determines how you should proceed with the analysis. Since a dependency chain will require at least 3*I cycles to execute, it takes about 8.1 iterations to reach cycle 24.3. So if the number of iterations is larger than 8.1, which is the case here, you need to analyze what happens after cycle 24.3.

The scheduler dispatches instructions at the following rates every cycle (as discussed above):

1
2
2
1
2
2
1
2
.
.

But the allocator will not allocate any uops in the RS unless there are at least 4 available entries. Otherwise, it will not waste power on issuing uops at a sub-optimal throughput. However, it is only at the beginning of every 4th cycle are there at least 4 free entries in the RS. So starting from cycle 24.3, the allocator is expected to get stalled 3 out of every 4 cycles.

Another important observation for the code being analyzed is that it never happens that there are more than 4 uops that can be dispatched, which means that the average number of uops that leave their execution units per cycle is not larger than 4. At most 4 uops can be retired from the ReOrder Buffer (ROB). This means that the ROB can never be on the critical path. In other words, performance is determined by the dispatch throughput.

We can calculate the IPC (instructions per cycles) fairly easily now. The ROB entries look something like this:

imul eax, eax     -  N
imul edx, edx     -  N + 1
dec ecx/jnz .loop -  M
imul eax, eax     -  N + 3
imul edx, edx     -  N + 4
dec ecx/jnz .loop -  M + 1

The column to the right shows the cycles in which the instruction can be retired. Retirement happens in order and is bounded by the latency of the critical path. Here each dependency chain have the same path length and so both constitute two equal critical paths of length 3 cycles. So every 3 cycles, 4 instructions can be retired. So the IPC is 4/3 = 1.3 and the CPI is 3/4 = 0.75. This is much smaller than the theoretical optimal IPC of 4 (even without considering micro- and macro-fusion). Because retirement happens in-order, the retirement behavior will be the same.

We can check our analysis using both perf and IACA. I'll discuss perf. I've a Haswell CPU.

perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-nolfence

 Performance counter stats for './main-1-nolfence' (10 runs):

         30,01,556      cycles:u                                                      ( +-  0.00% )
         40,00,005      instructions:u            #    1.33  insns per cycle          ( +-  0.00% )
                 0      RESOURCE_STALLS.ROB                                         
         23,42,246      UOPS_ISSUED.ANY                                               ( +-  0.26% )
         22,49,892      RESOURCE_STALLS.RS                                            ( +-  0.00% )

       0.001061681 seconds time elapsed                                          ( +-  0.48% )

There are 1 million iterations each takes about 3 cycles. Each iteration contains 4 instructions and the IPC is 1.33.RESOURCE_STALLS.ROB shows the number of cycles in which the allocator was stalled due to a full ROB. This of course never happens. UOPS_ISSUED.ANY can be used to count the number of uops issued to the RS and the number of cycles in which the allocator was stalled (no specific reason). The first is straightforward (not shown in the perf output); 1 million * 3 = 3 million + small noise. The latter is much more interesting. It shows that about 73% of all time the allocator stalled due to a full RS, which matches our analysis. RESOURCE_STALLS.RS counts the number of cycles in which the allocator was stalled due to a full RS. This is close to UOPS_ISSUED.ANY because the allocator does not stall for any other reason (although the difference could be proportional to the number of iterations for some reason, I'll have to see the results for T>1).

The analysis of the code without lfence can be extended to determine what happens if an lfence was added between the two imuls. Let's check out the perf results first (IACA unfortunately does not support lfence):

perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-lfence

 Performance counter stats for './main-1-lfence' (10 runs):

       1,32,55,451      cycles:u                                                      ( +-  0.01% )
         50,00,007      instructions:u            #    0.38  insns per cycle          ( +-  0.00% )
                 0      RESOURCE_STALLS.ROB                                         
       1,03,84,640      UOPS_ISSUED.ANY                                               ( +-  0.04% )
                 0      RESOURCE_STALLS.RS                                          

       0.004163500 seconds time elapsed                                          ( +-  0.41% )

Observe that the number of cycles has increased by about 10 million, or 10 cycles per iteration. The number of cycles does not tell us much. The number of retired instruction has increased by a million, which is expected. We already know that the lfence will not make instruction complete any faster, so RESOURCE_STALLS.ROB should not change. UOPS_ISSUED.ANY and RESOURCE_STALLS.RS are particularly interesting. In this output, UOPS_ISSUED.ANY counts cycles, not uops. The number of uops can also be counted (using cpu/event=0x0E,umask=0x1,name=UOPS_ISSUED.ANY/u instead of cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u) and has increased by 6 uops per iteration (no fusion). This means that an lfence that was placed between two imuls was decoded into 6 uops. The one million dollar question is now what these uops do and how they move around in the pipe.

RESOURCE_STALLS.RS is zero. What does that mean? This indicates that the allocator, when it sees an lfence in the IDQ, it stops allocating until all current uops in the ROB retire. In other words, the allocator will not allocate entries in the RS past an lfence until the lfence retires. Since the loop body contains only 3 other uops, the 60-entry RS will never be full. In fact, it will be always almost empty.

The IDQ in reality is not a single simple queue. It consists of multiple hardware structures that can operate in parallel. The number of uops an lfence requires depends on the exact design of the IDQ. The allocator, which also consists of many different hardware structures, when it see there is an lfence uops at the front of any of the structures of the IDQ, it suspends allocation from that structure until the ROB is empty. So different uops are usd with different hardware structures.

UOPS_ISSUED.ANY shows that the allocator is not issuing any uops for about 9-10 cycles per iteration. What is happening here? Well, one of the uses of lfence is that it can tell us how much time it takes to retire an instruction and allocate the next instruction. The following assembly code can be used to do that:

TIMES T lfence

The performance event counters will not work well for small values of T. For sufficiently large T, and by measuring UOPS_ISSUED.ANY, we can determine that it takes about 4 cycles to retire each lfence. That's because UOPS_ISSUED.ANY will be incremented about 4 times every 5 cycles. So after every 4 cycles, the allocator issues another lfence (it doesn't stall), then it waits for another 4 cycles, and so on. That said, instructions that produce results may require 1 or few more cycle to retire depending on the instruction. IACA always assume that it takes 5 cycles to retire an instruction.

Our loop looks like this:

imul eax, eax
lfence
imul edx, edx
dec ecx
jnz .loop

At any cycle at the lfence boundary, the ROB will contain the following instructions starting from the top of the ROB (the oldest instruction):

imul edx, edx     -  N
dec ecx/jnz .loop -  N
imul eax, eax     -  N+1

Where N denotes the cycle number at which the corresponding instruction was dispatched. The last instruction that is going to complete (reach the writeback stage) is imul eax, eax. and this happens at cycle N+4. The allocator stall cycle count will be incremented during cycles, N+1, N+2, N+3, and N+4. However it will about 5 more cycles until imul eax, eax retires. In addition, after it retires, the allocator needs to clean up the lfence uops from the IDQ and allocate the next group of instructions before they can be dispatched in the next cycle. The perf output tells us that it takes about 13 cycles per iteration and that the allocator stalls (because of the lfence) for 10 out of these 13 cycles.

The graph from the question shows only the number of cycles for up to T=100. However, there is another (final) knee at this point. So it would be better to plot the cycles for up to T=120 to see the full pattern.

这篇关于了解 lfence 对具有两个长依赖链的循环的影响,以增加长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆