在gem5中将perf_event与ARM PMU结合使用 [英] Using perf_event with the ARM PMU inside gem5

查看:145
本文介绍了在gem5中将perf_event与ARM PMU结合使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于gem5源代码和一些出版物,我知道ARM PMU是部分实现的.

I know that the ARM PMU is partially implemented, thanks to the gem5 source code and some publications.

我有一个二进制文件,该文件使用perf_event在ARM处理器下基于Linux的OS上访问PMU.它可以在ARM ISA下在具有Linux内核的gem5全系统仿真中使用perf_event吗?

I have a binary which uses perf_event to access the PMU on a Linux-based OS, under an ARM processor. Could it use perf_event inside a gem5 full-system simulation with a Linux kernel, under the ARM ISA?

到目前为止,我还没有找到正确的方法.如果有人知道,我将非常感激!

So far, I haven't found the right way to do it. If someone knows, I will be very grateful!

推荐答案

上下文

由于 gem5的未实现的功能,我无法使用性能监视单元( PMU ).可以在此处找到邮件列表上的参考..进行个人补丁后,可以通过 perf_event 访问 PMU .幸运的是,类似的补丁将很快在官方的 gem5 版本中发布,可以在

Context

I was not able to use the Performance Monitoring Unit (PMU) because of a gem5's unimplemented feature. The reference on the mailing list can be found here. After a personal patch, the PMU is accessible through perf_event. Fortunately, a similar patch will be released in the official gem5 release soon, could be seen here. The patch will be described in another answer, due to the number of link limitation inside one message.

这是使用 perf_event C 源代码的最小工作示例,用于计算分支预测器单元在特定任务期间错误预测的分支数:

This is a minimal working example of a C source code using perf_event, used to count the number of mispredicted branches by the branch predictor unit during a specific task:

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>

#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>

int main(int argc, char **argv) {
    /* File descriptor used to read mispredicted branches counter. */
    static int perf_fd_branch_miss;
    
    /* Initialize our perf_event_attr, representing one counter to be read. */
    static struct perf_event_attr attr_branch_miss;
    attr_branch_miss.size = sizeof(attr_branch_miss);
    attr_branch_miss.exclude_kernel = 1;
    attr_branch_miss.exclude_hv = 1;
    attr_branch_miss.exclude_callchain_kernel = 1;
    /* On a real system, you can do like this: */
    attr_branch_miss.type = PERF_TYPE_HARDWARE;
    attr_branch_miss.config = PERF_COUNT_HW_BRANCH_MISSES;
    /* On a gem5 system, you have to do like this: */
    attr_branch_miss.type = PERF_TYPE_RAW;
    attr_branch_miss.config = 0x10;
    
    /* Open the file descriptor corresponding to this counter. The counter
       should start at this moment. */
    if ((perf_fd_branch_miss = syscall(__NR_perf_event_open, &attr_branch_miss, 0, -1, -1, 0)) == -1)
        fprintf(stderr, "perf_event_open fail %d %d: %s\n", perf_fd_branch_miss, errno, strerror(errno));
    
    /* Workload here, that means our specific task to profile. */

    /* Get and close the performance counters. */
    uint64_t counter_branch_miss = 0;
    read(perf_fd_branch_miss, &counter_branch_miss, sizeof(counter_branch_miss));
    close(perf_fd_branch_miss);

    /* Display the result. */
    printf("Number of mispredicted branches: %d\n", counter_branch_miss);
}

我不会详细介绍如何使用 perf_event ,可以使用良好的资源此处此处此处.但是,关于上述代码的几点注意事项:

I will not enter into the details of how using perf_event, good resources are available here, here, here, here. However, just a few notes about the code above:

  • 在实际硬件上,当使用 perf_event 常见事件(在许多体系结构下可用的事件)时,建议使用 perf_event PERF_TYPE_HARDWARE 作为类型,并使用像 PERF_COUNT_HW_BRANCH_MISSES 这样的宏来预测错误的分支数量,使用 PERF_COUNT_HW_CACHE_MISSES 来作为高速缓存未命中的数量,依此类推开启(有关列表,请参见手册页).这是拥有可移植代码的最佳实践.
  • 在当前(v20.0)的 gem5 模拟系统上, C 源代码必须使用 PERF_TYPE_RAW 类型和体系结构事件ID识别事件.此处,0x10是 ARMv8-A参考手册(行对应于手册中所述的计数器声明.这不是正常现象,因此应修补 gem5 以便允许一天使用 PERF_TYPE_HARDWARE .
  • On real hardware, when using perf_event and common events (events that are available under a lot of architectures), it is recommended to use perf_event macros PERF_TYPE_HARDWARE as type and to use macros like PERF_COUNT_HW_BRANCH_MISSES for the number of mispredicted branches, PERF_COUNT_HW_CACHE_MISSES for the number of cache misses, and so on (see the manual page for a list). This is a best practice to have a portable code.
  • On a gem5 simulated system, currently (v20.0), a C source code have to use PERF_TYPE_RAW type and architectural event ID to identify an event. Here, 0x10 is the ID of the 0x0010, BR_MIS_PRED, Mispredicted or not predicted branch event, described in the ARMv8-A Reference Manual (here). In the manual, all events available in real hardware are described. However, they are not all implemented into gem5. To see the list of implemented event inside gem5, refer to the src/arch/arm/ArmPMU.py file. In the latter, the line self.addEvent(ProbeEvent(self,0x10, bpred, "Misses")) corresponds to the declaration of the counter described in the manual. This is not a normal behavior, hence gem5 should be patched to allow using PERF_TYPE_HARDWARE one day.

这不是完整的 MWE 脚本(太长了!),只有添加到整个系统脚本中才能使用 PMU 的所需部分.我们使用 ArmSystem 作为系统,并使用 RealView 平台.

This is not a entire MWE script (it would be too long!), only the needed portion to add inside a full-system script to use the PMU. We use an ArmSystem as a system, with the RealView platform.

对于每个 CPU ( eg(集群中的 SubSystem 类)中的 DerivO3CPU ),我们向其中添加了具有唯一中断的 PMU 数字和已经实施的架构事件.可以在 configs/example/arm/devices.py 中找到此功能的示例.

For each ISA (we use an ARM ISA here) of each CPU (e.g., a DerivO3CPU) in our cluster (which is a SubSystem class), we add to it a PMU with a unique interrupt number and the already implemented architectural event. An example of this function could be found in configs/example/arm/devices.py.

要选择一个中断号,请在平台中断映射中选择一个免费的 PPI 中断.在这里,我们根据 RealView 中断映射( src/dev/arm/RealView.py )选择 PPI n°20.由于 PPI 中断是每个 Processing Element ( PE ,对应于我们上下文中的内核)的本地中断,因此对于所有 PE 没有任何冲突.要了解有关 PPI 中断的更多信息,请参阅 ARM

To choose an interrupt number, pick a free PPI interrupt in the platform interrupt mapping. Here, we choose PPI n°20, according to the RealView interrupt map (src/dev/arm/RealView.py). Since PPIs interrupts are local per Processing Element (PE, corresponds to cores in our context), the interrupt number can be the same for all PE without any conflict. To know more about PPI interrupts, see the GIC guide from ARM here.

在这里,我们可以看到系统未使用n°20中断(来自 RealView.py ):

Here, we can see that the interrupt n°20 is not used by the system (from RealView.py):

Interrupts:
      0- 15: Software generated interrupts (SGIs)
     16- 31: On-chip private peripherals (PPIs)
        25   : vgic
        26   : generic_timer (hyp)
        27   : generic_timer (virt)
        28   : Reserved (Legacy FIQ)

我们将系统组件( dtb itb 等)传递给 addArchEvents ,以链接 PMU 这样, PMU 会将这些组件的内部计数器(称为 probes )用作系统的公开计数器.

We pass to addArchEvents our system components (dtb, itb, etc.) to link the PMU with them, thus the PMU will use the internal counters (called probes) of these components as exposed counters to the system.

for cpu in system.cpu_cluster.cpus:
    for isa in cpu.isa:
        isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
        # Add the implemented architectural events of gem5. We can
        # discover which events is implemented by looking at the file
        # "ArmPMU.py".
        isa.pmu.addArchEvents(
            cpu=cpu, dtb=cpu.dtb, itb=cpu.itb,
            icache=getattr(cpu, "icache", None),
            dcache=getattr(cpu, "dcache", None),
            l2cache=getattr(system.cpu_cluster, "l2", None))

这篇关于在gem5中将perf_event与ARM PMU结合使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆