系统范围分析器(例如perf)如何将计数器与指令关联起来? [英] How does a system wide profiler (e.g. perf) correlate counters with instructions?

查看:195
本文介绍了系统范围分析器(例如perf)如何将计数器与指令关联起来?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试了解系统范围内的探查器的工作方式.让我们以 linux性能为例.在一定的分析时间内,它可以提供:

I'm trying to understand how a system wide profiler works. Let's take linux perf as example. For a certain profiling time it can provide:

  • 各种汇总的hadware性能计数器
  • 每个用户空间进程和内核空间功能所花费的时间和硬件计数器(例如#指令)
  • 有关上下文切换的信息

我几乎可以确定的第一件事是该报告只是对实际情况的估计.因此,我认为有些内核模块可以以一定的采样率启动软件中断.采样率越低,探查器开销越小.该中断可以读取存储性能计数器的特定于模型的寄存器.

The first thing I'm almost sure about is that the report is just an estimation of what's really happening. So I think there's some kernel module that launches software interrupts at a certain sampling rate. The lower the sampling rate, the lower the profiler overhead. The interrupt can read the model specific registers that store the performance counters.

下一部分是将计数器与计算机上运行的软件相关联. 那是我不理解的部分.

The next part is to correlate the counters with the software that's running on the machine. That's the part I don't understand.

  1. 那么探查器从哪里获取数据?

  1. So where does the profiler gets its data from?

例如,您可以询问任务计划程序以找出打扰他时正在运行的内容吗?这不会影响 执行调度程序(例如,不要继续执行 被中断的功能只会安排另一个功能, 分析器结果不准确). task_struct 对象的列表是否可用?

Can you interrogate for example the task scheduler to find out what was running when you interrupted him? Won't that affect the execution of the scheduler (e.g. instead of continuing the interrupted function it will just schedule another one, making the profiler result not accurate). Is the list of task_struct objects available?

推荐答案

所以我认为有一些内核模块可以一定的采样率启动软件中断.

So I think there's some kernel module that launches software interrupts at a certain sampling rate.

Perf不是模块,它是Linux内核的一部分,在以下版本中实现 内核/events/core.c 和对于每个受支持的体系结构和cpu模型,例如

Perf is not module, it is part of the Linux kernel, implemented in kernel/events/core.c and for every supported architecture and cpu model, for example arch/x86/kernel/cpu/perf_event*.c. But Oprofile was a module, with similar approach.

Perf通常通过要求CPU的PMU(性能监视单元)在某些硬件性能计数器的N个事件后生成中断来工作(横滨,幻灯片5 "•达到阈值时中断:允许采样").实际上,它可以实现为:

Perf generally works by asking PMU (Performance monitoring unit) of CPU to generate interrupt after N events of some hardware performance counter (Yokohama, slide 5 "• Interrupt when threshold reached: allows sampling"). Actually it may be implemented as:

  • 选择一些PMU计数器
  • 将其初始化为-N,其中N是采样周期(我们希望在N个事件后中断,例如,在2百万个周期后perf record -c 2000000 -e cycles,或者在未设置其他选项的情况下,由perf计算并调整一些N或给出-F)
  • 将此计数器设置为所需事件,并要求PMU在溢出时生成中断(ARCH_PERFMON_EVENTSEL_INT).它会在我们的计数器增加N后发生.
  • select some PMU counter
  • initialize it to -N, where N is the sampling period (we want interrupt after N events, for example, after 2 millions of cycles perf record -c 2000000 -e cycles, or some N computed and tuned by perf when no extra option is set or -F is given)
  • set this counter to wanted event, and ask PMU to generate interrupt on overflow (ARCH_PERFMON_EVENTSEL_INT). It will happen after N increments of our counter.

所有现代英特尔芯片都支持此功能,例如Nehalem: https ://software.intel.com/sites/default/files/76/87/30320 -Nehalem性能监控单元编程指南

All modern Intel chips supports this, for example, Nehalem: https://software.intel.com/sites/default/files/76/87/30320 - Nehalem Performance Monitoring Unit Programming Guide

EBS-基于事件的采样.一种将计数器预加载有较大的负计数的技术,并将其配置为在溢出时中断处理器.当计数器溢出时,中断服务程序将捕获分析数据.

EBS - Event Based Sampling. A technique in which counters are pre-loaded with a large negative count, and they are configured to interrupt the processor on overflow. When the counter overflows the interrupt service routine capture profiling data.

因此,当您使用硬件PMU时,通过特殊读取硬件PMU计数器,不会在计时器中断时进行其他工作.有一些工作可以在任务开关处保存/恢复PMU状态,但是(内核/事件/core.c的*_sched_in/*_sched_out)不会更改当前线程的PMU计数器值,也不会将其导出到用户空间

So, when you use hardware PMU, there is no additional work at timer interrupt with special reading of hardware PMU counters. There is some work to save/restore PMU state at task switch, but this (*_sched_in/*_sched_out of kernel/events/core.c) will not change PMU counter value for current thread nor will export it to user-space.

有一个处理程序:

There is a handler: arch/x86/kernel/cpu/perf_event.c: x86_pmu_handle_irq which finds the overflowed counter and calls perf_sample_data_init(&data, 0, event->hw.last_period); to record the current time, IP of last executed command (it can be inexact because of out-of-order nature of most Intel microarchitetures, there is limited workaround for some events - PEBS, perf record -e cycles:pp), stacktrace data (if -g was used in record), etc. Then handler resets the counter value to the -N (x86_perf_event_set_period, wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); - note the minus before left)

采样率越低,探查器的开销越低.

The lower the sampling rate, the lower the profiler overhead.

Perf允许您使用-F选项设置目标采样率,-F 1000表示大约1000 irq/s.由于高开销,不建议使用高费率.十年前,英特尔VTune建议不要超过1000 irq/s( http ://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf "尝试每个逻辑CPU每秒获取约1000个样本."),性能通常,对于非根用户,不允许设置过高的速率(当性能中断花费的时间太长时,会自动调整为较低的速率")-检查您的dmesg;还要检查sysctl -a|grep perf,例如kernel.perf_cpu_time_max_percent=25-这意味着性能会尝试使用不超过25%的CPU)

Perf allows you to set target sampling rate with -F option, -F 1000 means around 1000 irq/s. High rates are not recommended due to high overhead. Ten years ago Intel VTune recommended not more than 1000 irq/s (http://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf "Try to get about a 1000 samples per second per logical CPU."), perf usually don't allow too high rate for non-root (autotuned to lower rate when "perf interrupt took too long" - check in your dmesg; also check sysctl -a|grep perf, for example kernel.perf_cpu_time_max_percent=25 - which means that perf will try to use not more then 25 % of CPU)

例如,您可以询问任务计划程序以找出打扰他时正在运行的程序吗?

Can you interrogate for example the task scheduler to find out what was running when you interrupted him?

不.但是您可以在sched_switch或其他计划的事件中启用跟踪点(列出所有在计划的可用中:perf list 'sched:*'),并将其用作性能分析事件.您甚至可以要求perf在此跟踪点记录stacktrace:

No. But you can enable tracepoint at sched_switch or other sched event (list all available in sched: perf list 'sched:*'), and use it as profiling event for the perf. You can even ask perf to record stacktrace at this tracepoint:

 perf record -a -g -e "sched:sched_switch" sleep 10

不会影响调度程序的执行

Won't that affect the execution of the scheduler

启用的跟踪点将使具有跟踪点的功能添加一些性能事件采样

Enabled tracepoint will make add some perf event sampling work to the function with tracepoint

task_struct对象列表可用吗?

Is the list of task_struct objects available?

仅通过ftrace ...

Only via ftrace...

有关上下文切换的信息

Information about context switches

这是软件性能事件,只需通过sched/core.c(间接)使用PERF_COUNT_SW_CONTEXT_SWITCHES事件调用perf_sw_event.直接调用的示例-迁移软件事件:内核/sched/core.c set_task_cpu():p->se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);

This is software perf event, just call to perf_sw_event with PERF_COUNT_SW_CONTEXT_SWITCHES event from sched/core.c (indirectly). Example of direct call - migration software event: kernel/sched/core.c set_task_cpu(): p->se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);

PS:Gregg在Linux中有关于perf,ftrace和其他概要分析和跟踪子系统的很好的幻灯片: http://www.brendangregg.com/linuxperf.html

PS: there are good slides on perf, ftrace and other profiling and tracing subsystems in Linux by Gregg: http://www.brendangregg.com/linuxperf.html

这篇关于系统范围分析器(例如perf)如何将计数器与指令关联起来?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆