性能可以解决所有缓存未命中的问题吗? [英] Can perf account for all cache misses?

查看:104
本文介绍了性能可以解决所有缓存未命中的问题吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试了解perf记录的缓存未命中.我有一个最小的程序:

I'm trying to understand the cache misses recorded by perf. I have a minimal program:

int main(void)
{
    return 0;
}

如果我对此进行编译:

gcc -std=c99 -W -Wall -Werror -O3 -S -o test.S test.c

我得到了一个预期很小的程序:

I get an expectedly small program:

        .file   "test.c"
        .section        .text.startup,"ax",@progbits
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        xorl    %eax, %eax
        ret
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (Debian 4.7.2-5) 4.7.2"
        .section        .note.GNU-stack,"",@progbits

仅使用 xorl ret 这两个指令,程序的大小应小于缓存行,因此我希望如果我运行 perf-e"cache-misses:u" ./test 我应该只看到一个缓存未命中.但是,我反而看到2到〜400之间.类似地, perf -e"cache-misses" ./test 导致〜700至〜2500.

With only the two instruction, xorl and ret, the program should be less than a cache line in size so I would expect that if I run perf -e "cache-misses:u" ./test I should see only a single cache miss. However, I instead see between 2 and ~400. Similarly, perf -e "cache-misses" ./test results in ~700 to ~2500.

这是仅用于估计计数的情况吗,还是有关高速缓存未命中的方式有​​某种方法可以使它们的推理变得近似?例如,如果我生成然后读取内存中的整数数组,我是否可以对预取进行推理(顺序访问应允许进行完美的预取),或者还有其他作用吗?

Is this simply a case of perf estimating counts or is there something about the way cache misses occur that makes reasoning about them approximate? For example, if I generate and then read an array of integers in memory, can I reason about the prefetching (sequential access should allow for perfect prefetching) or is there something else at play?

推荐答案

您创建了一个 main 而不是 _start ,并且可能将其构建为动态链接的可执行文件!!因此,这里有所有的CRT启动代码,初始化libc和几个系统调用.运行 strace ./test 并查看它正在调用多少个系统.(当然,用户空间中有很多工作不涉及系统调用).

You created a main instead of _start, and probably built it into a dynamically-linked executable!! So there's all the CRT startup code, initializing libc, and several system calls. Run strace ./test and see how many systems calls it's making. (And of course there's lots of work in user-space that doesn't involve system calls).

更有趣的是一个静态链接的可执行文件,它仅使用 syscall _exit(0)或 exit_group(0)系统调用指令,从 _start 入口点开始.

What would be more interesting is a statically linked executable that just makes an _exit(0) or exit_group(0) system call with the syscall instruction, from the _start entry point.

给出具有以下内容的 exit.s :

mov $231, %eax
syscall

将其构建为静态可执行文件,因此以下两条指令是在用户空间中执行的唯一指令:

build it into a static executable so these two instructions are the only ones executed in user-space:

$ gcc -static -nostdlib exit.s
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
  # the default is fine, our instructions are at the start of the .text section

$ perf stat -e cache-misses:u ./a.out 

 Performance counter stats for './a.out':

                 6      cache-misses:u                                              

       0.000345362 seconds time elapsed

       0.000382000 seconds user
       0.000000000 seconds sys

我告诉它要对 cache-misses:u 进行计数,以仅测量用户空间缓存未命中,而不是进程运行的核心内容.(这包括进入用户空间之前和处理 exit_group()系统调用之前的内核高速缓存未命中.以及可能的中断处理程序.)

I told it to count cache-misses:u to only measure user-space cache misses, instead of everything on the core the process was running on. (That would include kernel cache misses before entering user-space and while handling the exit_group() system call. And potentially interrupt handlers).

(PMU中提供了硬件支持,可以在特权级别为用户,内核或两者时对事件进行计数.因此,我们希望计数从内核转换过程中完成的计数最多减少1或2个计数->用户或用户->内核(更改CS,可能会导致GDT加载由新CS值索引的段描述符).

(There is hardware support in the PMU for events to count when the privilege level is user, kernel, or both. So we should expect counts to be off by at most 1 or 2 from counting stuff done during the transition from kernel->user or user->kernel. (Changing CS, potentially resulting in a load from the GDT of the segment descriptor indexed by the new CS value).

Linux perf如何计算缓存引用和缓存未命中事件解释:

perf 显然将 cache-misses 映射到计数最后一级高速缓存未命中的硬件事件.因此,这类似于DRAM访问次数.

perf apparently maps cache-misses to a HW event that counts last-level cache misses. So it's something like the number of DRAM accesses.

在L1未命中的情况下,多次尝试访问L1d或L1i高速缓存中的同一行只会增加另一件事,即等待相同的传入高速缓存行.因此,它不计算必须等待缓存的负载(或代码获取).多个负载可以合并为一个访问.

Multiple attempts to access the same line in L1d or L1i cache while an L1 miss is already outstanding just adds another thing waiting for the same incoming cache line. So it's not counting loads (or code-fetch) that have to wait for cache. Multiple loads can coalesce into one access.

但还要记住,代码获取需要通过iTLB进行,从而触发页面遍历.缓存页面遍历负载,即它们是通过缓存层次结构获取的.因此,如果它们确实错过了,则通过 cache-misses 事件进行计数.

But also remember that code-fetch needs to go through the iTLB, triggering a page-walk. Page-walk loads are cached, i.e. they're fetched through the cache hierarchy. So they're counted by the cache-misses event if they do miss.

重复运行该程序可能会导致 0 个缓存丢失事件.页面缓存.该物理内存被映射到运行它的进程的地址空间.在过程启动/停止过程中,L3中肯定可以保持高温.更有趣的是,页表显然也很热.(并不是字面上的呆滞";我认为内核每次都必须写一个新的.但是想必page-walker至少在L3缓存中命中了.)

Repeated runs of the program can result in 0 cache-miss events. The executable binary is a file, and the file is cached (OS's disk cache) by the pagecache. That physical memory is mapped into the address-space of the process running it. It can certainly stay hot in L3 across process start/stop. More interesting is that apparently the page-table stays hot, too. (Not literally "stays" hot; I assume the kernel has to write a new one every time. But presumably the page-walker is hitting at least in L3 cache.)

或者至少不需要引起额外"的 cache-miss 事件.

Or at least whatever else was causing the "extra" cache-miss events doesn't have to happen.

我使用了 perf stat -r16 来运行16次,并显示均值+ stddev

I used perf stat -r16 to run it 16 times and show mean +stddev

$ perf stat -e instructions:u,L1-dcache-loads:u,L1-dcache-load-misses:u,cache-misses:u,itlb_misses.walk_completed:u -r 16 ./exit

 Performance counter stats for './exit' (16 runs):

                 3      instructions:u                                              
                 1      L1-dcache-loads                                             
                 5      L1-dcache-load-misses     #  506.25% of all L1-dcache hits    ( +-  6.37% )
                 1      cache-misses:u                                                ( +-100.00% )
                 2      itlb_misses.walk_completed:u                                   

         0.0001422 +- 0.0000108 seconds time elapsed  ( +-  7.57% )

请注意+ -100%的缓存丢失.

Note the +-100% on cache-misses.

我不知道为什么会有2个itlb_misses.walk_completed事件,而不仅仅是1个.计数 itlb_misses.miss_causes_a_walk:u 却始终为我们提供 4 .

I don't know why we have 2 itlb_misses.walk_completed events, not just 1. Counting itlb_misses.miss_causes_a_walk:u instead gives us 4 consistently.

还原为 -r 1 并通过手动向上箭头反复运行, cache-misses 在3到13之间反弹.系统大部分处于空闲状态,但有点后台网络流量.

Reducing to -r 1 and running repeatedly with manual up-arrow, cache-misses bounces around between 3 and 13. The system is mostly idle but with a bit of background network traffic.

我也不知道为什么任何东西都显示为L1D负载,或者一次负载怎么会有6次未命中.但是哈迪的回答是, perf 的L1-dcache-load-misses事件实际上计为 L1D.REPLACEMENT ,因此,页面遍历可以解释这一点.当 L1-dcache-loads 计数为 MEM_INST_RETIRED.ALL_LOADS 时. mov-immediate 不是负载,我也不会想到 syscall 也是.但是也许是这样,否则硬件会错误地计数内核指令,或者某个地方存在偏离1的位置.

I also don't know why anything is showing as an L1D load, or how there can be 6 misses from one load. But Hadi's answer says that perf's L1-dcache-load-misses event actually counts L1D.REPLACEMENT, so the page-walks could account for that. While L1-dcache-loads counts MEM_INST_RETIRED.ALL_LOADS. mov-immediate isn't a load, and I wouldn't have thought syscall is either. But maybe it is, otherwise the HW is falsely counting a kernel instruction or there's an off-by-1 somewhere.

这篇关于性能可以解决所有缓存未命中的问题吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆