是否可以知道高速缓存未命中的地址? [英] Is it possible to know the address of a cache miss?

查看:68
本文介绍了是否可以知道高速缓存未命中的地址?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每当发生高速缓存未命中时,是否可以知道该高速缓存未命中行的地址?

Whenever a cache miss occurs, is it possible to know the address of that missed cache line? Are there any hardware performance counters in modern processors that can provide such information?

推荐答案

是的,在现代Intel硬件上有精确的内存采样事件不仅可以跟踪指令的地址,还可以跟踪数据地址。这些事件还包括许多其他信息,例如满足内存访问的高速缓存层次结构级别,总延迟等。

Yes, on modern Intel hardware there are precise memory sampling events that track not only the address of the instruction, but the data address as well. These events also includes a great deal of other information, such as what level of the cache hierarchy the memory access was satisfied it, the total latency and so on.

您可以使用 perf mem 来采样此信息并生成报告。

You can use perf mem to sample this information and produces a report.

例如,以下程序:

#include <stddef.h>

#define SIZE (100 * 1024 * 1024)

int p[SIZE] = {1};

void do_writes(volatile int *p) {
    for (size_t i = 0; i < SIZE; i += 5) {
        p[i] = 42;
    }
}

void do_reads(volatile int *p) {
    volatile int sink;
    for (size_t i = 0; i < SIZE; i += 5) {
        sink = p[i];
    }
}

int main(int argc, char **argv) {
    do_writes(p);
    do_reads(p);
}

编译为:

g++  -g -O1 -march=native   perf-mem-test.cpp   -o perf-mem-test

并运行:

sudo perf mem record -U ./perf-mem-test && sudo perf mem report

生成按延迟排序的内存访问报告,如下所示:

Produces a report of memory accesses sorted by latency like this:

数据符号列显示了加载的目标地址-大多数显示为 p + 0xa0658b4 之类的东西,这意味着偏移量为 0xa0658b4 p 开始的code>,因为代码正在读取和写入 p 。该列表按本地权重排序,即参考周期中的访问延迟 1

The Data Symbol column shows where address the load was targeting - most here show up as something like p+0xa0658b4 which means at an offset of 0xa0658b4 from the start of p which makes sense as the code is reading and writing p. The list is sorted by "local weight" which is the access latency in reference cycles1.

请注意,记录的信息只是<内存访问的样本:记录每个未命中通常是过多的信息。此外,默认情况下,它仅记录延迟为30个周期或更长的负载,但是您显然可以使用命令行参数对其进行调整。

Note that the information recorded is only a sample of memory accesses: recording every miss would usually be way too much information. Furthermore, it only records loads with a latency of 30 cycles or more by default, but you can apparently tweak this with command line arguments.

如果您仅对访问所有级别的缓存中未命中的访问,您正在寻找本地RAM命中行 2 。也许您可以将采样限制为仅缓存未命中-我很确定英特尔内存采样功能支持这一点,并且我认为您可以告诉 perf mem 仅查看未命中

If you're only interested in accesses that miss in all levels of cache, you're looking for the "Local RAM hit" lines2. Perhaps you can restrict your sampling to only cache misses - I'm pretty sure the Intel memory sampling stuff supports that, and I think you can tell perf mem to look at only misses.

最后,请注意,这里我在记录后使用 -U 自变量指示 perf mem 仅记录用户空间事件。默认情况下,它将包括内核事件,这可能对您有用或可能不有用。对于示例程序,有很多内核事件与将 p 数组从二进制文件复制到可写进程内存有关。

Finally, note that here I'm using the -U argument after record which instructs perf mem to only record userspace events. By default it will include kernel events, which may or may not be useful for your. For the example program, there are many kernel events associated with copying the p array from the binary into writable process memory.

请记住,我专门安排了程序,使全局数组 p 最终出现在初始化的 .data 中部分(二进制文件约为400 MB!),以便在清单中显示正确的符号。您的进程将在绝大多数时间中访问动态分配的或堆栈的内存,这只会给您一个原始地址。是否可以将此映射回有意义的对象取决于您是否跟踪足够的信息以使之成为可能。

Keep in mind that I specifically arranged my program such that the global array p ended up in the initialized .data section (the binary is ~400 MB!), so that it shows up with the right symbol in the listing. The vast majority of the time your process is going to be accessing dynamically allocated or stack memory, which will just give you a raw address. Whether you can map this back to a meaningful object depends on if you track enough information to make that possible.

1 认为它处于参考周期,但是我可能错了,内核可能已经将其转换为纳秒了?

1 I think it's in reference cycles, but I could be wrong and the kernel may have already converted it to nanoseconds?

2 这里的本地和命中部分是指我们命中了连接到当前内核的RAM,即,我们没有去往与另一个插座关联的RAM多插槽NUMA配置。

2 The "Local" and "hit" part here refer to the fact that we hit the RAM attached to the current core, i.e., we didn't have go to the RAM associated with another socket in a multi-socket NUMA configuration.

这篇关于是否可以知道高速缓存未命中的地址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆