将 perf_event 与 gem5 中的 ARM PMU 一起使用 [英] Using perf_event with the ARM PMU inside gem5

查看:56
本文介绍了将 perf_event 与 gem5 中的 ARM PMU 一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于 gem5 源代码和一些出版物,我知道 ARM PMU 已部分实现.

I know that the ARM PMU is partially implemented, thanks to the gem5 source code and some publications.

我有一个二进制文件,它使用 perf_event 访问基于 Linux 的操作系统上的 PMU,在 ARM 处理器下.它可以在 ARM ISA 下,在带有 Linux 内核的 gem5 全系统模拟中使用 perf_event 吗?

I have a binary which uses perf_event to access the PMU on a Linux-based OS, under an ARM processor. Could it use perf_event inside a gem5 full-system simulation with a Linux kernel, under the ARM ISA?

到目前为止,我还没有找到正确的方法.如果有人知道,我将不胜感激!

So far, I haven't found the right way to do it. If someone knows, I will be very grateful!

推荐答案

上下文

我无法使用性能监控单元 (PMU),因为 gem5 的功能未实现.邮件列表上的参考可以在此处找到.在个人补丁之后,PMU 可以通过 perf_event 访问.幸运的是,一个类似的补丁很快就会在官方 gem5 版本中发布,可以看到 此处.由于一条消息内的链接数量限制,该补丁将在另一个答案中描述.

Context

I was not able to use the Performance Monitoring Unit (PMU) because of a gem5's unimplemented feature. The reference on the mailing list can be found here. After a personal patch, the PMU is accessible through perf_event. Fortunately, a similar patch will be released in the official gem5 release soon, could be seen here. The patch will be described in another answer, due to the number of link limitation inside one message.

这是一个使用 perf_eventC 源代码的最小工作示例,用于计算特定任务期间分支预测器单元错误预测的分支数量:

This is a minimal working example of a C source code using perf_event, used to count the number of mispredicted branches by the branch predictor unit during a specific task:

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>

#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>

int main(int argc, char **argv) {
    /* File descriptor used to read mispredicted branches counter. */
    static int perf_fd_branch_miss;
    
    /* Initialize our perf_event_attr, representing one counter to be read. */
    static struct perf_event_attr attr_branch_miss;
    attr_branch_miss.size = sizeof(attr_branch_miss);
    attr_branch_miss.exclude_kernel = 1;
    attr_branch_miss.exclude_hv = 1;
    attr_branch_miss.exclude_callchain_kernel = 1;
    /* On a real system, you can do like this: */
    attr_branch_miss.type = PERF_TYPE_HARDWARE;
    attr_branch_miss.config = PERF_COUNT_HW_BRANCH_MISSES;
    /* On a gem5 system, you have to do like this: */
    attr_branch_miss.type = PERF_TYPE_RAW;
    attr_branch_miss.config = 0x10;
    
    /* Open the file descriptor corresponding to this counter. The counter
       should start at this moment. */
    if ((perf_fd_branch_miss = syscall(__NR_perf_event_open, &attr_branch_miss, 0, -1, -1, 0)) == -1)
        fprintf(stderr, "perf_event_open fail %d %d: %s\n", perf_fd_branch_miss, errno, strerror(errno));
    
    /* Workload here, that means our specific task to profile. */

    /* Get and close the performance counters. */
    uint64_t counter_branch_miss = 0;
    read(perf_fd_branch_miss, &counter_branch_miss, sizeof(counter_branch_miss));
    close(perf_fd_branch_miss);

    /* Display the result. */
    printf("Number of mispredicted branches: %d\n", counter_branch_miss);
}

我不会详细介绍如何使用 perf_event,好的资源可用 这里这里这里此处.但是,对上面的代码做一些说明:

I will not enter into the details of how using perf_event, good resources are available here, here, here, here. However, just a few notes about the code above:

  • 在真实硬件上,当使用perf_eventcommon events(很多架构下都有的事件)时,推荐使用perf_eventPERF_TYPE_HARDWARE 作为类型并使用像 PERF_COUNT_HW_BRANCH_MISSES 这样的宏表示错误预测分支的数量,PERF_COUNT_HW_CACHE_MISSES 表示缓存未命中的数量,等等上(请参阅手册页以获取列表).这是拥有可移植代码的最佳做法.
  • gem5 模拟系统上,当前 (v20.0),C 源代码必须使用 PERF_TYPE_RAW 类型和架构事件 ID识别事件.此处,0x10 是 0x0010、BR_MIS_PRED、错误预测或未预测分支 事件的 ID,在 ARMv8-A 参考手册(此处).在手册中,描述了真实硬件中可用的所有事件.然而,它们并没有全部实现到 gem5 中.要查看 gem5 中已实现的事件列表,请参阅 src/arch/arm/ArmPMU.py 文件.在后者中,行 self.addEvent(ProbeEvent(self,0x10, bpred, Misses")) 对应于手册中描述的计数器的声明.这不是正常行为,因此应该修补 gem5 以允许有一天使用 PERF_TYPE_HARDWARE.
  • On real hardware, when using perf_event and common events (events that are available under a lot of architectures), it is recommended to use perf_event macros PERF_TYPE_HARDWARE as type and to use macros like PERF_COUNT_HW_BRANCH_MISSES for the number of mispredicted branches, PERF_COUNT_HW_CACHE_MISSES for the number of cache misses, and so on (see the manual page for a list). This is a best practice to have a portable code.
  • On a gem5 simulated system, currently (v20.0), a C source code have to use PERF_TYPE_RAW type and architectural event ID to identify an event. Here, 0x10 is the ID of the 0x0010, BR_MIS_PRED, Mispredicted or not predicted branch event, described in the ARMv8-A Reference Manual (here). In the manual, all events available in real hardware are described. However, they are not all implemented into gem5. To see the list of implemented event inside gem5, refer to the src/arch/arm/ArmPMU.py file. In the latter, the line self.addEvent(ProbeEvent(self,0x10, bpred, "Misses")) corresponds to the declaration of the counter described in the manual. This is not a normal behavior, hence gem5 should be patched to allow using PERF_TYPE_HARDWARE one day.

这不是一个完整的 MWE 脚本(它太长了!),只是需要添加到完整系统脚本中以使用 PMU 的部分.我们使用 ArmSystem 作为系统,使用 RealView 平台.

This is not a entire MWE script (it would be too long!), only the needed portion to add inside a full-system script to use the PMU. We use an ArmSystem as a system, with the RealView platform.

对于每个CPU(例如,一个 DerivO3CPU) 在我们的集群中(它是一个 SubSystem 类),我们向它添加一个具有唯一中断的 PMU编号和已经实施的架构事件.可以在 configs/example/arm/devices.py 中找到此函数的示例.

For each ISA (we use an ARM ISA here) of each CPU (e.g., a DerivO3CPU) in our cluster (which is a SubSystem class), we add to it a PMU with a unique interrupt number and the already implemented architectural event. An example of this function could be found in configs/example/arm/devices.py.

要选择中断号,请在平台中断映射中选择一个空闲的PPI 中断.这里,我们根据RealView中断映射(src/dev/arm/RealView.py)选择PPI n°20.由于 PPI 中断是每个 Processing Element 的本地中断(PE,对应于我们上下文中的内核),所有 PE 没有任何冲突.要了解有关 PPI 中断的更多信息,请参阅 ARMGIC 指南 此处.

To choose an interrupt number, pick a free PPI interrupt in the platform interrupt mapping. Here, we choose PPI n°20, according to the RealView interrupt map (src/dev/arm/RealView.py). Since PPIs interrupts are local per Processing Element (PE, corresponds to cores in our context), the interrupt number can be the same for all PE without any conflict. To know more about PPI interrupts, see the GIC guide from ARM here.

在这里,我们可以看到系统没有使用中断n°20(来自RealView.py):

Here, we can see that the interrupt n°20 is not used by the system (from RealView.py):

Interrupts:
      0- 15: Software generated interrupts (SGIs)
     16- 31: On-chip private peripherals (PPIs)
        25   : vgic
        26   : generic_timer (hyp)
        27   : generic_timer (virt)
        28   : Reserved (Legacy FIQ)

我们将系统组件(dtbitb 等)传递给 addArchEvents 以链接 PMU有了它们,PMU 将使用这些组件的内部计数器(称为探测器)作为暴露给系统的计数器.

We pass to addArchEvents our system components (dtb, itb, etc.) to link the PMU with them, thus the PMU will use the internal counters (called probes) of these components as exposed counters to the system.

for cpu in system.cpu_cluster.cpus:
    for isa in cpu.isa:
        isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
        # Add the implemented architectural events of gem5. We can
        # discover which events is implemented by looking at the file
        # "ArmPMU.py".
        isa.pmu.addArchEvents(
            cpu=cpu, dtb=cpu.dtb, itb=cpu.itb,
            icache=getattr(cpu, "icache", None),
            dcache=getattr(cpu, "dcache", None),
            l2cache=getattr(system.cpu_cluster, "l2", None))

这篇关于将 perf_event 与 gem5 中的 ARM PMU 一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆