为什么每次迭代的微指令数会随着流负载的增长而增加? [英] Why does the number of uops per iteration increase with the stride of streaming loads?

查看:87
本文介绍了为什么每次迭代的微指令数会随着流负载的增长而增加?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下循环:

.loop:
    add     rsi, OFFSET    
    mov     eax, dword [rsi]
    dec     ebp
    jg .loop

其中,OFFSET是一些非负整数,并且rsi包含指向bss节中定义的缓冲区的指针.此循环是代码中的唯一循环.也就是说,它不会在循环之前被初始化或触动.大概在Linux上,缓冲区的所有4K虚拟页面将按需映射到同一物理页面.因此,缓冲区大小的唯一限制是虚拟页面的数量.因此,我们可以轻松地使用非常大的缓冲区进行实验.

该循环包含4条指令.在Haswell上的融合域和非融合域中,每条指令均被解码为单个uop.在add rsi, OFFSET的连续实例之间也存在循环携带的依赖关系.因此,在负载总是命中L1D的空闲条件下,循环应以每次迭代大约1个周期的速度执行.对于小偏移量(步幅),这要归功于基于IP的L1流预取器和L2流预取器.但是,两个预取器只能在4K页内进行预取,并且L1预取器支持的最大跨度为2K.因此,对于小步幅,每4K页应该有大约1个L1丢失.随着步幅的增加,L1和TLB未命中的总数将增加,性能也会相应降低.

下图显示了跨0到128步幅的各种有趣的性能计数器(每次迭代).请注意,对于所有实验,迭代次数都是恒定的.仅缓冲区大小会更改以适应指定的步幅.此外,仅计算用户模式的性能事件.

这里唯一奇怪的是,随着步幅的增加,退休的oops的数量也在增加.步长为128,每次迭代从3 oups(按预期)变为11.为什么呢?

如下图所示,事情只会变得步伐更大.在此图中,步幅从32到8192(以32字节为增量).首先,退回指令的数量以步长4096字节从4线性增加到5,之后保持不变.负载的数量从1增加到3,L1D负载命中次数在每次迭代中保持1.对于所有步幅,只有L1D负载未命中的次数对我来说才有意义.

较大步幅的两个明显影响是:

  • 执行时间增加,因此将发生更多的硬件中断.但是,我正在计算用户模式事件,因此中断不应干扰我的测量.我还用tasksetnice重复了所有实验,并得到了相同的结果.
  • 页面遍历和页面错误的数量增加. (我已经对此进行了验证,但是为了简洁起见,我将省略这些图.)页面错误由内核以内核模式处理.根据答案,分页浏览是使用专用硬件实现的(在Haswell?).尽管答案所基于的链接已消失.

为进一步研究,下图显示了微码辅助的微指令数.就像其他性能事件一样,每次迭代的微代码辅助指令的数量会增加,直到达到步幅4096的最大值为止.对于所有跨度,每4K虚拟页面的微码辅助代码数为506. "Extra UOPS"行绘制的是已退休的uops数减去3(每次迭代的预期uops数).

该图显示,所有步幅的额外指令数略大于微码辅助指令数的一半.我不知道这意味着什么,但这可能与页面浏览有关,并且可能是观察到扰动的原因.

为什么即使每次迭代的静态指令数量相同,对于较大的步幅,每次迭代的退休指令和微指令的数量也会增加?干扰来自哪里?


以下图形针对不同的步幅绘制了每次迭代的循环数与每次迭代的退休uops数.循环数的增加比退休的oops的增加快得多.通过使用线性回归,我发现:

cycles = 0.1773 * stride + 0.8521
uops = 0.0672 * stride + 2.9277

采用两个函数的导数:

d(cycles)/d(stride) = 0.1773
d(uops)/d(stride) = 0.0672

这意味着步幅每增加1个字节,循环数将增加0.1773,而退休uops数将增加0.0672.如果中断和页面错误确实是造成干扰的唯一原因,难道两个比率都不应该很接近吗?

解决方案

在许多性能计数器上反复看到的效果,即该值线性增加直到步幅4096保持不变,如果您假设效果完全是由于页面错误随着步幅增加而增加.页面错误会影响观察值,因为许多计数器都不正确存在中断,页面错误等情况.

例如,使用instructions计数器,当您从大步0前进到4096时,计数器从4变到5.我们从

类似地,对于L3小姐,我观察到每个负载3 oups.

鉴于此,假设在新页面上出现未命中会导致负载uop重播两次(如我所观察到的),并且这些uop出现在MEM_UOPS_RETIRED计数器中,这似乎是合理的.有人可能会辩称,已重播的微词没有退休,但从某种意义上说,退休比微词与指令更相关.也许将此计数器更好地描述为与已退休的装载指令相关联的分配的运货".

UOPS_RETIRED.ALLIDQ.MS_UOPS

剩余的怪异是与每个页面相关联的大量uops.这似乎完全有可能与页面错误机制有关.您可以尝试在TLB中错过但未出现页面错误的类似测试(请确保页面已被填充,例如,使用mmapMAP_POPULATE).

MS_UOPSUOPS_RETIRED之间的区别似乎并不奇怪,因为某些uops可能不会退役.也许它们也属于不同的域(我忘记了UOPS_RETIRED是融合域还是未融合域).

在这种情况下,用户和内核模式计数之间可能还会泄漏.

周期与uop导数

在问题的最后一部分,您显示了循环相对于偏移的斜率"大约是退休uops与偏移的斜率的2.6倍.

如上所述,此处的效果在4096处停止,我们再次希望此效果完全是由于页面错误引起的.因此,斜率的差异仅意味着页面错误的循环次数比oups的循环次数多2.6倍.

您说:

如果中断和页面错误确实是造成干扰的唯一原因,难道两个比率都不应该很接近吗?

我不明白为什么.微指令和周期之间的关系可能相差很大,可能相差三个数量级:CPU可能每个周期执行4个微指令,或者执行一个微指令可能需要100个周期(例如,缺少缓存的负载).

每uop 2.6个周期的值正好在这个大范围的中间,并不让我感到奇怪:它有点高(如果您在谈论优化的应用程序代码,则为低效率"),但是在这里谈论页面错误处理是完全不同的事情,因此我们预计会有很长的延迟.

研究过度计数

任何因页面错误和其他事件而对超额计数感兴趣的人都可能对此github存储库对各种PMU事件的确定性"进行了详尽的测试,并且注意到许多这种性质的结果,包括在Haswell上.但是,它并不涵盖Hadi在这里提到的所有计数器(否则我们已经有了答案). 这是相关的论文,并且有些易于使用关联的幻灯片-他们特别提到了一条额外的说明出现页面错误.

这里是对来自英特尔的结果的引用:

Conclusions on the event determinism:
1.  BR_INST_RETIRED.ALL (0x04C4)
a.  Near branch (no code segment change): Vince tested 
    BR_INST_RETIRED.CONDITIONAL and concluded it as deterministic. 
    We verified that this applies to the near branch event by using 
    BR_INST_RETIRED.ALL - BR_INST_RETIRED.FAR_BRANCHES.
b.  Far branch (with code segment change): BR_INST_RETIRED.FAR_BRANCHES 
    counts interrupts and page-faults. In particular, for all ring 
    (OS and user) levels the event counts 2 for each interrupt or 
    page-fault, which occurs on interrupt/fault entry and exit (IRET).
    For Ring 3 (user) level,  the counter counts 1 for the interrupt/fault
    exit. Subtracting the interrupts and faults (PerfMon event 0x01cb and
    Linux Perf event - faults), BR_INST_RETIRED.FAR_BRANCHES remains a 
    constant of 2 for all the 17 tests by Perf (the 2 count appears coming
    from the Linux Perf for counter enabling and disabling). 
Consequently, BR_INST_RETIRED.FAR_BRANCHES is deterministic. 

因此,您希望每个页面错误都有一条额外的指令(特别是分支指令).


1 在许多情况下,这种不精确性"仍然是确定性的-因为在存在外部因素的情况下,计数过多或计数不足总是以相同的方式进行事件,因此,如果您还跟踪发生了多少相关事件,则可能可以进行更正.

2 我的意思不是将其限制为这两个微体系结构:它们只是我所测试的.

Consider the following loop:

.loop:
    add     rsi, OFFSET    
    mov     eax, dword [rsi]
    dec     ebp
    jg .loop

where OFFSET is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. Presumably, on Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. Therefore, the only limit on the buffer size is the number of virtual pages. So we can easily experiment with very large buffers.

The loop consists of 4 instructions. Each instruction is decoded into a single uop in the fused and unfused domain on Haswell. There is also a loop-carried dependency between the successive instances of add rsi, OFFSET. Therefore, under idle conditions where the load always hit in the L1D, the loop should execute at about 1 cycle per iteration. For small offsets (strides), this is expected thanks to the IP-based L1 streaming prefetcher and the L2 streaming prefetcher. However, both prefetchers can only prefetch within a 4K page and the maximum stride supported by the L1 prefetcher is 2K. So for small strides, there should be about 1 L1 miss per 4K page. As the stride increases, the total number of L1 misses and TLB misses will increase and performance will deteriorate accordingly.

The following graph shows various interesting performance counters (per iteration) for strides between 0 and 128. Note that the number of iterations is constant for all experiments. Only the buffer size changes to accommodate the specified stride. In addition, only user-mode performance events are counted.

The only weird thing here is the number of retired uops is increasing with the stride. It goes from 3 uops per iteration (as expected) to 11 for stride 128. Why is that?

Things only get weirder with larger strides, as the following graph shows. In this graph, the strides range from 32 to 8192 with 32-byte increments. First, the number of retired instructions increases linearly from 4 to 5 at stride 4096 bytes after which it remains constant. The number of load uops increases from 1 to 3 and the number of L1D load hits remains 1 per iteration. Only the number of L1D load misses makes sense to me for all strides.

The two obvious effects of larger strides are:

  • The execution time increases and so more hardware interrupts will occur. However, I'm counting user-mode events, so interrupts should not interfere with my measurements. I've also repeated all experiments with taskset or nice and got the same results.
  • The number of page walks and page faults increases. (I've verified this but I'll omit the graphs for brevity.) Page faults are handled by the kernel in kernel-mode. According to this answer, page walks are implemented using dedicated hardware (on Haswell?). Although the link that the answer is based on is dead.

To investigate further, the following graph shows the number of uops from microcode assists. The number of microcode assist uops per iteration increases until it reaches the maximum value at stride 4096, just like with the other performance events. The number of microcode assist uops per 4K virtual page is 506 for all strides. The "Extra UOPS" line plots the number of retired uops minus 3 (the expected number of uops per iteration).

The graph shows that the number of extra uops is slightly larger than half of the number of microcode assist uops for all strides. I don't know what this means, but it could be related to page walks and could be the reason for the observed perturbation.

Why are the numbers of retired instructions and uops per iteration increasing for larger strides even though the number of static instructions per iteration is the same? Where is the interference coming from?


The following graphs plot the number of cycles per iteration against the number of retired uops per iteration for different strides. The number of cycles increases much more quickly than the number of retired uops. By using linear regression, I found:

cycles = 0.1773 * stride + 0.8521
uops = 0.0672 * stride + 2.9277

Taking the derivatives of both functions:

d(cycles)/d(stride) = 0.1773
d(uops)/d(stride) = 0.0672

This means that the number of cycles increase by 0.1773 and the number of retired uops increase by 0.0672 with each 1 byte increment in stride. If interrupts and page faults were indeed the (only) cause of perturbation, shouldn't both rates be very close?

解决方案

The effect you see repeatedly across many of the performance counters, where the value increases linearly until stride 4096 after which it stays constant, makes total sense if you assume the effect is purely due to increasing page faults with increasing stride. Page faults affect the observed values because many counters are not exact in the presence of interrupts, page-faults and so on.

For example, take the instructions counter which ramps from 4 to 5 as you progress from stride 0 to 4096. We know from other sources that each page fault on Haswell will count one extra instruction in user mode (and one extra in kernel mode as well).

So the number of instructions we expect is the base of 4 instructions in the loop, plus some fraction of an instruction based on how many page faults we take per loop. If we assume each new 4 KiB page causes a page fault, then the number of page faults per iteration is:

MIN(OFFSET / 4096, 1)

Since each page fault counts an extra instruction, we have then for the expected instruction count:

4 + 1 * MIN(OFFSET / 4096, 1)

which is in perfect agreement with your graph.

So then the rough shape of the sloped graphed is explained for all the counters at once: with the slope depending only on the amount of over-counting per page fault. Then the only remaining question is why a page-fault effects each counter in the way you determined. We've covered instructions already but let's take a peek at the other ones:

MEM_LOAD_UOPS.L1_MISS

You get only 1 miss per page because only the load that touches the next page misses anything (it takes a fault). I don't actually agree that is the L1 prefetcher that results in no other misses: I think you'd get the same result if you turned off the prefetchers. I think you get no more L1 misses since the same physical page backs every virtual page and once you've added the TLB entry all lines are already in L1 (the very first iteration will miss - but I guess you are doing many iterations).

MEM_UOPS_RETIRED.ALL_LOADS

This shows 3 uops (2 extra) per page-fault.

I'm not 100% sure how this event works in the presence of uop replay. Does it always count a fixed number of uops based on the instruction, e.g., the number you'd see in Agner's instruction -> uop tables? Or does it count the actual number of uops dispatched on behalf of the instruction? This is usually the same, but loads replay their uops when they miss at various cache levels.

For example, I have found that on Haswell and Skylake2 when a load misses in L1 but hits in L2, you see 2 uops total between the load ports (port2 and port3). Presumably what happens is that the uop is dispatched with the assumption it will hit in L1, and when this doesn't happen (the result is not ready when the scheduler expected it), it gets replayed with new timing anticipating an L2 hit. This is "lightweight" in that it doesn't require any kind of pipeline clear as no wrong-path instructions have been executed.

Similarly for an L3 miss I have observed 3 uops per load.

Given that, it seems reasonable to assume the miss on the new page causes the load uop to be replayed twice (as I have observed), and those uops show up in the MEM_UOPS_RETIRED counter. One may reasonably argue that the replayed uops are not retired, but in some sense retirement is more associated with instructions than uops. Maybe this counter would be better described as "dispatched uops associated with retired load instructions".

UOPS_RETIRED.ALL and IDQ.MS_UOPS

The remaining weirdness is the large number of uops associated with each page. It seems entirely possible that this is associated with the page-fault machinery. You could try a similar test that misses in the TLB, but doesn't take the page-fault (make sure the pages are already populated, e.g., using mmap with MAP_POPULATE).

The difference between MS_UOPS and UOPS_RETIRED doesn't seem that odd since some uops may not retired. Maybe also they count in different domains (I forget if UOPS_RETIRED is fused or unfused domain).

Maybe there is also leakage between user and kernel mode counts in this case.

Cycles versus uop derivative

In the last part of your question, you show that the "slope" of cycles versus offset is about 2.6x larger than the slope of retired uops versus offset.

As above, the effect here stops at 4096 and we expect again this effect is entirely due to page-faults. So the difference in slope just means that a page fault costs 2.6x more cycles than it does uops.

You say:

If interrupts and page faults were indeed the (only) cause of perturbation, shouldn't both rates be very close?

I don't see why. The relationship between uops and cycles can vary widely, by perhaps three order of magnitude: the CPU might execute four uops per cycle, or it might take 100s of cycles to execute a single uop (such as a cache-missing load).

The value of 2.6 cycles per uop is right in the middle of this big range and doesn't strike me as odd: it is a bit high ("inefficient" if you were talking about optimized application code) but here we are talking about page fault handling which is a totally different thing, so we expect long delays.

Studies into over-counting

Anyone interested in over-counting due to page-faults and other events might be interested in this github repository which has exhaustive tests for "determinism" of various PMU events, and where many results of this nature have been noted, including on Haswell. It doesn't however cover all the counters Hadi mentions here (otherwise we'd already have our answer). Here's the associated paper and some easier-to-consume associated slides - they mention in particular that one extra instructions is incurred per page fault.

Here's a quote for the results from Intel:

Conclusions on the event determinism:
1.  BR_INST_RETIRED.ALL (0x04C4)
a.  Near branch (no code segment change): Vince tested 
    BR_INST_RETIRED.CONDITIONAL and concluded it as deterministic. 
    We verified that this applies to the near branch event by using 
    BR_INST_RETIRED.ALL - BR_INST_RETIRED.FAR_BRANCHES.
b.  Far branch (with code segment change): BR_INST_RETIRED.FAR_BRANCHES 
    counts interrupts and page-faults. In particular, for all ring 
    (OS and user) levels the event counts 2 for each interrupt or 
    page-fault, which occurs on interrupt/fault entry and exit (IRET).
    For Ring 3 (user) level,  the counter counts 1 for the interrupt/fault
    exit. Subtracting the interrupts and faults (PerfMon event 0x01cb and
    Linux Perf event - faults), BR_INST_RETIRED.FAR_BRANCHES remains a 
    constant of 2 for all the 17 tests by Perf (the 2 count appears coming
    from the Linux Perf for counter enabling and disabling). 
Consequently, BR_INST_RETIRED.FAR_BRANCHES is deterministic. 

So you expect one extra instruction (in particular, a branch instruction), per page-fault.


1 In many cases this "inexactness" is still deterministic - in that the over- or under-counting always behaves in the same way in the presence of the external event, so you may be able to correct for it if you also track how many of the relevant events have happened.

2 I don't mean to limit it to those two micro-architectures: they just happen to be the ones I've tested.

这篇关于为什么每次迭代的微指令数会随着流负载的增长而增加?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆