为什么仅在存在存储初始化循环时才对用户模式L1存储未命中事件进行计数? [英] Why are the user-mode L1 store miss events only counted when there is a store initialization loop?

查看:101
本文介绍了为什么仅在存在存储初始化循环时才对用户模式L1存储未命中事件进行计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

摘要

考虑以下循环:

loop:
movl   $0x1,(%rax)
add    $0x40,%rax
cmp    %rdx,%rax
jne    loop

其中,rax初始化为大于L3高速缓存大小的缓冲区的地址.每次迭代都会对下一个缓存行执行存储操作.我希望从L1D发送到L2的RFO请求的数量或多或少等于所访问的缓存行的数量.问题是,即使程序在用户模式下运行,我也只在计数内核模式事件时才出现这种情况,下面将讨论一种情况.分配缓冲区的方式似乎无关紧要(.bss,.data或来自堆).

详细信息

我的实验结果如下表所示.所有这些实验都是在禁用超线程并启用所有硬件预取器的处理器上执行的.

我已经测试了以下三种情况:

  • 没有初始化循环.也就是说,在上面显示的主"循环之前不访问该缓冲区.我将这种情况称为NoInit.在这种情况下,只有一个循环.
  • 首先使用每个高速缓存行一条加载指令来访问该缓冲区.触摸所有行后,便会执行主循环.我将这种情况称为LoadInit.在这种情况下有两个循环.
  • 首先使用每个高速缓存行一条存储指令来访问缓冲区.触摸所有行后,便会执行主循环.我将这种情况称为StoreInit.在这种情况下有两个循环.

下表显示了在Intel CFL处理器上的结果.这些实验是在Linux内核版本4.4.0上进行的.

下表显示了在Intel HSW处理器上的结果.请注意,事件L2_RQSTS.PF_HITL2_RQSTS.PF_MISSOFFCORE_REQUESTS.ALL_REQUESTS没有记录为HSW.这些实验是在Linux内核4.15版上进行的.

每个表的第一列包含性能监视事件的名称,其计数在其他列中显示.在列标签中,字母UK分别代表用户模式和内核模式事件.对于具有两个循环的情况,数字1和2分别用于表示初始化循环和主循环.例如,LoadInit-1K表示LoadInit情况的初始化循环的内核模式计数.

表中显示的值通过高速缓存行数标准化.它们也按如下颜色编码.对于同一表中的所有其他单元格,绿色越深,该值越大.但是,CFL表的最后三行和HSW表的最后两行未进行颜色编码,因为这些行中的某些值太大.这些行被涂成深灰色,表示它们没有像其他行一样用颜色编码.

我希望用户模式L2_RQSTS.ALL_RFO事件的数量等于访问的缓存行的数量(即归一化值1).手册中对此事件的描述如下:

计算发送到L2的RFO(读取所有权)请求的总数 缓存. L2 RFO请求包括L1D需求RFO未命中以及 L1D RFO预取.

它表示L2_RQSTS.ALL_RFO不仅可以计算来自L1D的需求RFO请求,还可以计算L1D RFO的预取.但是,我观察到事件计数不受两个处理器上启用或禁用L1D预取器的影响.但是,即使L1D预取器可以生成RFO预取,事件计数也应该至少与所访问的高速缓存行数一样大.从两个表都可以看出,这仅在StoreInit-2U中.相同的观察结果适用于表中显示的所有事件.

但是,事件的内核模式计数大约等于预期的用户模式计数.与此相反,例如MEM_INST_RETIRED.ALL_STORES(或HSW上的MEM_UOPS_RETIRED.ALL_STORES),它按预期工作.

由于PMU计数器寄存器的数量有限,我不得不将所有实验分为四个部分.特别是,内核模式计数是从与用户模式计数不同的运行中产生的.相同的内容实际上并不重要.我认为告诉您这一点很重要,因为这可以解释为什么某些用户模式计数要比相同事件的内核模式计数大一些.

以深灰色显示的事件似乎计数过高.英特尔第四代和第八代处理器规范手册确实提到了OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO可能计数过高的问题(分别是HSD61和111问题).但是这些结果表明,它可能被多次高估,而不仅仅是一些事件.

还有其他有趣的发现,但它们与问题无关,即:为什么RFO计数不如预期?

解决方案

您没有标记操作系统,但是假设您使用的是Linux.在其他操作系统上(甚至可能在同一操作系统的各种变体中),这些东西也会有所不同.

对未映射页面的读取访问时,内核页面故障处理程序将以只读权限映射到系统范围的共享零页面.

这解释了列LoadInit-1U|K:即使您的初始负载跨过执行负载的64 MB的虚拟区域,但是只有一个 physical 4K页面填充了零被映射,因此在第一个4KB之后您将获得大约零个缓存未命中,在归一化后将舍入为零. 1

在对未映射页面或只读共享零页面的写访问中,内核将代表进程映射一个新的唯一页面.保证新页面将被清零,因此,除非内核中有一些已知的零页面在附近徘徊,否则这涉及在映射之前将页面清零(实际上是memset(new_page, 0, 4096)).

在很大程度上解释了除StoreInit-2U|K之外的其余列.在这些情况下,即使用户程序似乎正在执行所有存储操作,内核也会完成所有艰苦的工作(每页一个存储区除外),因为随着用户进程在每一页中出现错误,内核会写入零.这样做的副作用是将所有页面都放入L1缓存中.当故障处理程序返回时,该页面的触发存储和所有后续存储将在L1缓存中命中.

它仍然不能完全解释StoreInit-2.正如评论中所阐明的那样,K列实际上包括用户数,这说明了该列(减去用户数后,对于每个事件,其值大致都为零,这是预期的).剩下的困惑是为什么L2_RQSTS.ALL_RFO不是1而是一些较小的值,例如0.53或0.68.可能是事件计数不足,或者我们缺少一些微体系结构效应,例如某种预取阻止了RFO(例如,如果在存储之前通过某种类型的加载操作将行加载到L1中, ,则不会发生RFO).您可以尝试包含其他L2_RQSTS事件,以查看丢失的事件是否在那里显示.

变化

并不一定要在所有系统上都一样.当然,其他操作系统可能有不同的策略,但是即使是基于x86的Linux,其运行方式也会因各种因素而有所不同.

例如,您可能会分配2 MiB,而不是4K零页面巨大的零页面.由于2 MiB不适合L1,因此这将改变基准,因此LoadInit测试可能会在第一个和第二个循环中显示用户空间中的缺失.

更一般而言,如果您使用的是大页面,则页面错误粒度将从4 KiB更改为2 MiB,这意味着被调零页面的一小部分将保留在L1和L2中,因此您将得到L1和L2未命中,如您所料.如果您的内核实现了针对匿名映射(或您所使用的任何映射)的故障排除使用),可能会产生类似的效果.

另一种可能性是内核可能在后台将页面设为零,因此准备好了零页面.这将从测试中删除K个计数,因为在页面错误期间不会发生调零,并且可能会将预期的未命中数添加到用户计数中.我不确定Linux内核是否曾经这样做或是否可以选择这样做,但是有补丁正在浮动.像BSD这样的其他操作系统也可以做到这一点.

RFO预取器

关于"RFO预取器"-RFO预取器并不是通常意义上的预取器,它们与L1D预取器无关,可以关闭.据我所知,从L1D进行"RFO预取"只是指向存储缓冲区中即将到达存储缓冲区头部的存储发送RFO请求.显然,当存储到达缓冲区的开头时,该发送RFO了,您不会将其称为预取-但为什么不发送对第二个从头存储的请求,依此类推?这些是RFO预取,但是它们与常规预取的不同之处在于核心知道已请求的地址:这不是猜测.

有一种猜测是,如果另一个内核在该内核有机会从该行写入RFO之前为该行发送RFO,则获取除当前磁头以外的其他行可能会浪费工作.在这种情况下,请求是无用的,只是增加了一致性流量.因此,如果预测失败的次数太多,可能会减少此存储缓冲区的预取.从某种意义上说,在存储缓冲区预取可能会发送对尚未退休的初级存储的请求的情况下,如果存储最终处于错误的路径,则会以无用的请求为代价发送请求.我实际上不确定当前的实现是否可以做到这一点.


1 这种行为实际上取决于L1缓存的详细信息:当前的Intel VIPT实现允许同一行的多个虚拟别名有效地存在于L1中.当前的AMD Zen实施使用不同的实施(微标签),这种实施不允许L1在逻辑上包含多个虚拟别名,因此我希望Zen在这种情况下会错过L2.

Summary

Consider the following loop:

loop:
movl   $0x1,(%rax)
add    $0x40,%rax
cmp    %rdx,%rax
jne    loop

where rax is initialized to the address of a buffer that is larger than the L3 cache size. Every iteration performs a store operation to the next cache line. I expect that the number of RFO requests sent from the L1D to the L2 to be more or less equal to the number of cache lines accessed. The problem is that this seems to be only the case when I count kernel-mode events even though the program runs in user-mode, except in one case as I discuss below. The way the buffer is allocated does not seem to matter (.bss, .data, or from the heap).

Details

The results of my experiments are shown in the tables below. All of the experiments are performed on processors with hyperthreading disabled and all hardware prefetchers enabled.

I've tested the following three cases:

  • There is no initialization loop. That is, the buffer is not accessed before the "main" loop shown above. I'll refer to this case as NoInit. There is only one loop in this case.
  • The buffer is first accessed using one load instruction per cache line. Once all the lines are touched, the main loop is then executed. I'll refer to this case as LoadInit. There are two loops in this case.
  • The buffer is first accessed using one store instruction per cache line. Once all the lines are touched, the main loop is then executed. I'll refer to this case as StoreInit. There are two loops in this case.

The following table shows the results on an Intel CFL processor. These experiments have been performed on Linux kernel version 4.4.0.

The following table shows the results on an Intel HSW processor. Note that the events L2_RQSTS.PF_HIT, L2_RQSTS.PF_MISS, and OFFCORE_REQUESTS.ALL_REQUESTS are not documented for HSW. These experiments have been performed on Linux kernel version 4.15.

The first column of each table contains the names of the performance monitoring events whose counts are the shown in the other columns. In the column labels, the letters U and K represent user-mode and kernel-mode events, respectively. For the cases that have two loops, the numbers 1 and 2 are used to refer to the initialization loop and the main loop, respectively. For example, LoadInit-1K represents the kernel-mode counts for the initialization loop of the LoadInit case.

The values shown in the tables are normalized by the number of cache lines. They are also color-coded as follows. The darker the green color is the larger the value is with respect to all other cells in the same table. However, the last three rows of the CFL table and the last two rows of the HSW table are not color-coded because some of the values in these rows are too large. These rows are painted in dark gray to indicate that they are not color-coded like the other rows.

I expect that the number of user-mode L2_RQSTS.ALL_RFO events to be equal to the number of cache lines accessed (i.e., a normalized value of 1). This event is described in the manual as follows:

Counts the total number of RFO (read for ownership) requests to L2 cache. L2 RFO requests include both L1D demand RFO misses as well as L1D RFO prefetches.

It says that L2_RQSTS.ALL_RFO may not only count demand RFO requests from the L1D but also L1D RFO prefetches. However, I've observed that the event count is not affected by whether the L1D prefetchers are enabled or disabled on both processors. But even if the L1D prefetchers may generated RFO prefetches, the event count then should be at least as large as the number of cache lines accessed. As can be seen from both tables, this is only the case in StoreInit-2U. The same observation applies to all of the events show in the tables.

However, the kernel-mode counts of the events are about equal to what the user-mode counts are expected to be. This is in contrast to, for example, MEM_INST_RETIRED.ALL_STORES (or MEM_UOPS_RETIRED.ALL_STORES on HSW), which works as expected.

Due to the limited number of PMU counter registers, I had to divide all the experiments into four parts. In particular, the kernel-mode counts are produced from different runs than the user-mode counts. It doesn't really matter what is being counted in the same. I think it's important to tell you this because this explains why some user-mode counts are a little larger than the kernel-mode counts of the same events.

The events shown in dark gray seem to overcount. The 4th gen and 8th gen Intel processor specification manuals do mention (problem HSD61 and 111, respectively) that OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO may overcount. But these results indicate that it may be overcounted by many times, not by just a couple of events.

There are other interesting observations, but they are not pertinent to the question, which is: why are the RFO counts not as expected?

解决方案

You didn't flag your OS, but let's assume you are using Linux. This stuff would be different on another OS (and perhaps even within various variants of the same OS).

On a read access to an unmapped page, the kernel page fault handler maps in a system-wide shared zero page, with read-only permissions.

This explains columns LoadInit-1U|K: even though your init load is striding over a virtual area of 64 MB performing loads, only a single physical 4K page filled with zeros is mapped, so you get approximately zero cache misses after the first 4KB, which rounds to zero after your normalization.1

On a write access to an unmapped page, or to the read-only shared zero page, the kernel will map a a new unique page on behalf of the process. This new page is guaranteed to be zeroed, so unless the kernel has some known-to-be-zero pages hanging around, this involves zeroing the page (effectively memset(new_page, 0, 4096)) prior to mapping it.

That largely explains the remaining columns except for StoreInit-2U|K. In those cases, even though it seems like the user program is doing all the stores, the kernel ends up doing all of the hard work (except for one store per page) since as the user process faults in each page, the kernel writes zeros to it, which has the side effect of bringing all the pages into the L1 cache. When the fault handler returns, the triggering store and all subsequent stores for that page will hit in the L1 cache.

It still doesn't fully explain StoreInit-2. As clarified in the comments, the K column actually includes the user counts, which explains that column (subtracting out the user counts leaves it at roughly zero for every event, as expected). The remaining confusion is why L2_RQSTS.ALL_RFO is not 1 but some smaller value like 0.53 or 0.68. Maybe the event is undercounting, or there is some micro-architectural effect that we're missing, like a type of prefetch that prevents the RFO (for example, if the line is loaded into the L1 by some type of load operation before the store, the RFO won't occur). You could try to include the other L2_RQSTS events to see if the missing events show up there.

Variations

It doesn't need to be like that on all systems. Certainly other OSes may have different strategies, but even Linux on x86 might behave differently based on various factors.

For example, rather than the 4K zero page, you might get allocated a 2 MiB huge zero page. That would change the benchmark since 2 MiB doesn't fit in L1, so the LoadInit tests will probably show misses in user-space on the first and second loops.

More generally, if you were using huge pages, the page fault granularity would be changed from 4 KiB to 2 MiB, meaning that only a small part of the zeroed page would remain in L1 and L2, so you'd get L1 and L2 misses, as you expected. If your kernel ever implements fault-around for anonymous mappings (or whatever mapping you are using), it could have a similar effect.

Another possibility is that the kernel may zero pages in the background and so have zero pages ready. This would remove the K counts from the tests, since the zeroing doesn't happen during the page fault, and would probably add the expected misses to the user counts. I'm not sure if the Linux kernel ever did this or has the option to do it, but there were patches floating around. Other OSes like BSD have done it.

RFO Prefetchers

About "RFO prefetchers" - the RFO prefetchers are not really prefetchers in the usual sense and they are unrelated to the L1D prefetchers can be turned off. As far as I know "RFO prefetching" from the L1D simply refers to sending an RFO request for stores in the store buffer which are reaching the head of the store buffer. Obviously when a store gets to the head of the buffer, it's time to send an RFO, and you wouldn't call that a prefetch - but why not send some requests for the second-from-the-head store too, and so on? Those are the RFO prefetches, but they differ from a normal prefetch in that the core knows the address that has been requested: it is not a guess.

There is speculation in the sense that getting additional lines other than the current head may be wasted work if another core sends an RFO for that line before the core has a chance to write from it: the request was useless in that case and just increased coherency traffic. So there are predictors that may reduce this store buffer prefetch if it fails too often. There may also be speculation in the sense that store buffer prefetch may sent requests for junior stores which haven't retired, at the cost of a useless request if the store ends up being on a bad path. I'm not actually sure if current implementations do that.


1 This behavior actually depends on the details of the L1 cache: current Intel VIPT implementations allow multiple virutal aliases of the same single line to all live happily in L1. Current AMD Zen implementations use a different implementation (micro-tags) which don't allow the L1 to logically contain multiple virtual aliases, so I would expect Zen to miss to L2 in this case.

这篇关于为什么仅在存在存储初始化循环时才对用户模式L1存储未命中事件进行计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆