为什么在L1缓存中将MFENCE与存储指令块预取一起使用? [英] Why does using MFENCE with store instruction block prefetching in L1 cache?

查看:198
本文介绍了为什么在L1缓存中将MFENCE与存储指令块预取一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个64字节大小的对象:

I have an object of 64 byte in size:

typedef struct _object{
  int value;
  char pad[60];
} object;

主要是我正在初始化对象数组:

in main I am initializing array of object:

volatile object * array;
int arr_size = 1000000;
array = (object *) malloc(arr_size * sizeof(object));

for(int i=0; i < arr_size; i++){
    array[i].value = 1;
    _mm_clflush(&array[i]);
}
_mm_mfence();

然后再次遍历每个元素.这是我正在为以下事件计数的循环:

Then loop again through each element. This is the loop I am counting events for:

int tmp;
for(int i=0; i < arr_size-105; i++){
    array[i].value = 2;
    //tmp = array[i].value;
     _mm_mfence();
 }

在这里没有mfence毫无意义,但我在捆绑其他东西,无意间发现,如果我进行了 store操作而没有mfence ,我将收到50万次RFO请求(以papi L2_RQSTS.ALL_RFO衡量)事件),这意味着又有50万首L1命中,是在需求之前预取的.但是,包含mfence 会导致一百万个RFO请求,从而产生RFO_HIT,这意味着缓存行仅在L2缓存中预取,而不再在L1缓存中预取.

having mfence does not make any sense here but I was tying something else and accidentally found that if I have store operation, without mfence I get half million of RFO requests (measured by papi L2_RQSTS.ALL_RFO event), which means that another half million was L1 hit, prefetched before demand. However including mfence results in 1 million RFO requests, giving RFO_HITs, that means that cache line is only prefetched in L2, not in L1 cache anymore.

除了英特尔文档以某种方式另外指出这一事实:可以在执行MFENCE指令之前,之中或之后将数据推测性地带入高速缓存中."我使用 load操作进行了检查.没有mfence时,我的命中率最高为2000 L1,而使用mfence时,我的命中率最高为100万L1(以papi MEM_LOAD_RETIRED.L1_HIT事件衡量).缓存行已在L1中预取以用于加载指令.

Besides the fact that Intel documentation somehow indicates otherwise: "data can be brought into the caches speculatively just before, during, or after the execution of an MFENCE instruction." I checked with load operations. without mfence I get up to 2000 L1 hit, whereas with mfence, I have up to 1 million L1 hit (measured with papi MEM_LOAD_RETIRED.L1_HIT event). The cache lines are prefetched in L1 for load instruction.

因此,不应该包括mfence块预取.存储和加载操作几乎都需要花费相同的时间-不需5-6毫秒,而需20毫秒.我经历了有关mfence的其他问题,但未提及预取对它的预期行为,我也没有看到足够好的理由或解释,为什么它仅使用存储操作会阻止L1缓存中的预取.还是我可能缺少某些功能描述?

So it should not be the case that including mfence blocks prefetching. Both the store and load operations take almost the same time - without mfence 5-6 msec, with mfence 20 msec. I went through other questions regarding mfence but it's not mentioned what is expected behavior for it with prefetching and I don't see good enough reason or explanation why it would block prefetching in L1 cache with only store operations. Or I might be missing something for mfence description?

我正在Skylake微体系结构上进行测试,但是与Broadwell进行了核对,并获得了相同的结果.

I am testing on Skylake miroarchitecture, however checked with Broadwell and got the same result.

推荐答案

不是L1预取会导致您看到计数器值:即使禁用了L1预取器,效果仍然存在.实际上,如果禁用除L2流媒体之外的所有预取器,效果仍然存在:

It's not L1 prefetching that causes the counter values you see: the effect remains even if you disable the L1 prefetchers. In fact, the effect remains if you disable all prefetchers except the L2 streamer:

wrmsr -a 0x1a4 "$((2#1110))"

但是,如果 do 禁用了L2流光,则计数与您期望的一样:即使没有mfence,您也会看到大约1,000,000个L2.RFO_MISSL2.RFO_ALL.

If you do disable the L2 streamer, however, the counts are as you'd expect: you see roughly 1,000,000 L2.RFO_MISS and L2.RFO_ALL even without the mfence.

首先,需要注意的是L2_RQSTS.RFO_*事件计数不对源自L2流媒体的RFO事件进行计数.您可以在此处看到详细信息,但基本上每个0x24 RFO事件的umask为:

First, it is important to note that the L2_RQSTS.RFO_* events count do not count RFO events originating from the L2 streamer. You can see the details here, but basically the umask for each of the 0x24 RFO events are:

name      umask
RFO_MISS   0x22
RFO_HIT    0x42
ALL_RFO    0xE2

请注意,所有umask值都没有0x10位,该位指示应该跟踪源自L2流媒体的事件.

Note that none of the umask values have the 0x10 bit which indicates that events which originate from the L2 streamer should be tracked.

似乎发生的事情是,当L2流媒体处于活动状态时,您可能希望分配给其中一个事件的许多事件被L2预取器事件吃掉"了.可能发生的情况是L2预取器在请求流之前运行,并且当需求RFO来自L1时,它发现已经从L2预取器进行了一个请求.这只会再次增加事件的umask |= 0x10版本(实际上,包括该位的话,我总共获得2,000,000个引用),这意味着RFO_MISSRFO_HITRFO_ALL会错过它.

It seems like what happens is that when the L2 streamer is active, many of the events that you might expect to be assigned to one of those events are instead "eaten" by the L2 prefetcher events instead. What likely happens is that the L2 prefetcher is running ahead of the request stream, and when the demand RFO comes in from L1, it finds a request already in progress from the L2 prefetcher. This only increments again the umask |= 0x10 version of the event (indeed I get 2,000,000 total references when including that bit), which means that RFO_MISS and RFO_HIT and RFO_ALL will miss it.

这有点类似于"fb_hit"方案,在该方案中,L1既未加载也未命中,但未加载,但遇到了正在进行的加载-但复杂的是该加载是由L2预取器启动的.

It's somewhat analogous to the "fb_hit" scenario, where L1 loads neither miss nor hit exactly, but hit an in-progress load - but the complication here is the load was initiated by the L2 prefetcher.

mfence只是将所有内容放慢了速度,以至于L2预取器几乎总是有时间将线路一直拉到L2,从而得到RFO_HIT计数.

The mfence just slows everything down enough that the L2 prefetcher almost always has time to bring the line all the way to L2, giving an RFO_HIT count.

我认为这里根本不涉及L1预取器(事实是,如果您将它们关闭,则效果相同):据我所知,L1预取器不与商店交互,仅加载.

I don't think the L1 prefetchers are involved here at all (shown by the fact that this works the same if you turn them off): as far as I know L1 prefetchers don't interact with stores, only loads.

以下是一些有用的perf命令,您可以用来查看包含"L2流媒体起源"位的区别.这里不包含L2流媒体事件:

Here are some useful perf commands you can use to see the difference in including the "L2 streamer origin" bit. Here's w/o the L2 streamer events:

perf stat --delay=1000 -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/

,其中包括:

perf stat --delay=1000 -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/

我针对此代码运行了这些命令(sleep(1)与传递给perf的--delay=1000命令对齐,以排除初始化代码):

I ran these against this code (with the sleep(1) lining up with the --delay=1000 command passed to perf to exclude the init code):

#include <time.h>
#include <immintrin.h>
#include <stdio.h>
#include <unistd.h>

typedef struct _object{
  int value;
  char pad[60];
} object;

int main() {
    volatile object * array;
    int arr_size = 1000000;
    array = (object *) malloc(arr_size * sizeof(object));

    for(int i=0; i < arr_size; i++){
        array[i].value = 1;
        _mm_clflush((const void*)&array[i]);
    }
    _mm_mfence();

    sleep(1);
    // printf("Starting main loop after %zu ms\n", (size_t)clock() * 1000u / CLOCKS_PER_SEC);

    int tmp;
    for(int i=0; i < arr_size-105; i++){
        array[i].value = 2;
        //tmp = array[i].value;
        // _mm_mfence();
    }
}

这篇关于为什么在L1缓存中将MFENCE与存储指令块预取一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆