PERF_TYPE_HARDWARE和PERF_TYPE_HW_CACHE并发监视 [英] PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE concurrent monitoring

查看:179
本文介绍了PERF_TYPE_HARDWARE和PERF_TYPE_HW_CACHE并发监视的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究perf_event_open syscall之上的自定义实现.

该实现旨在为任意内核上的特定线程支持各种PERF_TYPE_HARDWAREPERF_TYPE_SOFTWAREPERF_TYPE_HW_CACHE事件.

英特尔®64和IA-32体系结构软件开发人员手册3B 中,我看到以下测试CPU(Kaby Lake)的信息:

到目前为止,据我了解,一个人可以同时(理论上)监视无限制的PERF_TYPE_SOFTWARE事件,但可以并发地监视(不进行多路复用)PERF_TYPE_HARDWAREPERF_TYPE_HW_CACHE事件,因为每个事件都是由受限事件来衡量的(如上图所示)上面的手册)CPU的PMU的计数器数量.

因此,对于启用了超线程的四核Kaby Lake CPU,我假设最多可以同时监视4个PERF_TYPE_HARDWARE/PERF_TYPE_HW_CACHE事件(如果仅使用4个线程,则最多可以监视8个事件).

通过上述假设进行的实验发现,虽然我可以成功地监视最多4个PERF_TYPE_HARDWARE事件(对于8个线程),但对于PERF_TYPE_HW_CACHE事件却不是这样,其中最多可以同时监视2个事件!

我还尝试仅使用4个线程,但同时监视的"PERF_TYPE_HARDWARE"事件的上限仍为4.

一个人可能会问:为什么需要避免多路复用.首先,应通过避免潜在的多路复用盲点来实现尽可能高的准确性,其次,当超过上限"时,所有事件值均为0 ...

我定位的PERF_TYPE_HW_CACHE事件是:

CACHE_LLC_READ(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_READ.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_ACCESS.value << 16),
CACHE_LLC_WRITE(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_WRITE.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_ACCESS.value << 16),
CACHE_LLC_READ_MISS(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_READ.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_MISS.value << 16),
CACHE_LLC_WRITE_MISS(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_WRITE.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_MISS.value << 16),

全部通过提供的公式实现:

(perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
(perf_hw_cache_op_result_id << 16)

并作为一个组进行操作(第一个是组长,等等).

所以,我的问题如下:

  1. PMU的哪个计数器用于PERF_TYPE_HARDWARE,哪些用于PERF_TYPE_HW_CACHE事件,在哪里可以找到此信息?
  2. PERF_TYPE_HARDWARE预定义事件(例如PERF_COUNT_HW_CACHE_MISSES)和PERF_TYPE_HW_CACHE事件之间有什么区别?
  3. 关于如何在不复用所有列出的PERF_TYPE_HW_CACHE事件的情况下进行监视的任何建议?
  4. 关于如何在不复用多达8个PERF_TYPE_HARDWARE或/和PERF_TYPE_HW_CACHE事件的情况下进行监视的任何建议吗?

提前谢谢!

解决方案

  1. PERF_TYPE_HARDWAREPERF_TYPE_HW_CACHE事件被映射到性能监视中涉及的两组寄存器.第一组MSR称为IA32_PERFEVTSELx,其中x可以在0到N-1之间变化,N是可用的通用计数器的总数. PERFEVTSEL是性能事件选择"的缩写,它们指定满足发生事件计数的各种条件.第二组MSR称为IA32_PMCx,其中x与PERFEVTSEL相似.这些PMC寄存器存储性能监视事件的计数.每个PERFEVTSEL寄存器都与一个对应的PMC寄存器配对.

映射发生如下-

在内核的体系结构特定部分初始化时,注册了用于测量硬件特定事件的pmu 此处 .

for (i = 0; i < x86_pmu.num_counters; i++) {
        if (!reserve_perfctr_nmi(x86_pmu_event_addr(i)))
            goto perfctr_fail;
    }

    for (i = 0; i < x86_pmu.num_counters; i++) {
        if (!reserve_evntsel_nmi(x86_pmu_config_addr(i)))
            goto eventsel_fail;
    }

num_counters =由CPUID指令标识的通用计数器的数量.

除此之外,还有几个额外的用于监视脱机事件(例如,LLC缓存特定事件)的注册.

在更高版本的体系结构性能监控中,某些硬件事件是通过固定用途寄存器来测量的,如寄存器-

#define MSR_ARCH_PERFMON_FIXED_CTR0 0x309
#define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a
#define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b

  1. PERF_TYPE_HARDWARE预定义的事件都是建筑性能监控事件.这些事件是体系结构的,因为每个体系结构性能事件的行为在支持该事件的所有处理器上都是一致的.所有PERF_TYPE_HW_CACHE事件都是非体系结构,这意味着它们是特定于模型的,并且可能因处理器家族的不同而异.

  2. 对于我拥有的Intel Kaby Lake机器,总共预定义了20个PERF_TYPE_HW_CACHE事件.事件约束涉及,确保将3个固定功能计数器映射到3个PERF_TYPE_HARDWARE体系结构事件.每个固定功能计数器只能测量一个事件,因此我们可以将其丢弃以进行分析.另一个限制是,由于只有两个OFFCORE RESPONSE寄存器,因此只能同时测量两个针对LLC缓存的事件.同样,nmi-watchdog可以将事件固定到通用计数器系列中的另一个计数器.如果nmi-watchdog被禁用,我们将剩下4个通用计数器.

鉴于所涉及的限制以及可用计数器的数量有限,如果同时测量所有20个硬件高速缓存事件,则无法避免多路复用.在不引起多路复用及其错误的情况下,用于测量所有事件的一些解决方法是-

3.1.将所有PERF_TYPE_HW_CACHE事件分组为4个组,以便可以同时在4个通用计数器中的每个调度4个事件.确保一个组中的LLC缓存事件不超过2个.运行相同的配置文件,并分别获取每个组的计数.

3.2.如果要同时监视所有PERF_TYPE_HW_CACHE事件,则可以通过减小perf_event_mux_interval_ms的值来减少多路复用的某些错误.可以通过名为/sys/devices/cpu/perf_event_mux_interval_ms的sysfs条目进行配置.该值不能降低到一个点以下,如

  • 监视最多8个硬件或硬件缓存事件将需要禁用超线程.请注意,使用CPUID指令检索有关可用通用计数器数量的信息,并通过early_initcall函数在内核启动的体系结构初始化部分设置此类计数器的数量.可以在此处看到.初始化完成后,内核会了解到只有4个计数器可用,并且以后超线程功能的任何更改都不会有任何区别.
  • I'm working on a custom implementation on top of perf_event_open syscall.

    The implementation aims to support various of PERF_TYPE_HARDWARE, PERF_TYPE_SOFTWARE and PERF_TYPE_HW_CACHE events for specific threads on any core.

    In Intel® 64 and IA-32 Architectures Software Developer’s Manual vol 3B I see the following for my testing CPU (Kaby Lake):

    To my understanding so far, one can monitor (theoretically) unlimited PERF_TYPE_SOFTWARE events concurrently but limited (without multiplexing) PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events concurrently since each event is measured by the limited (as can be seen on the manual above) number of counters of the CPU's PMU.

    So for a quad-core Kaby Lake CPU with HyperThreading enabled I assume that up to 4 PERF_TYPE_HARDWARE/PERF_TYPE_HW_CACHE events can be monitored concurrently (or up to 8 if only 4 threads are used).

    Experimenting with the above assumptions I found out that while I can successfully monitor up to 4 PERF_TYPE_HARDWARE events (for 8 threads) this is not the case for PERF_TYPE_HW_CACHE events where only up to 2 events can be monitored concurrently!

    I also tried to use only 4 threads but the upper limit of concurrently monitored 'PERF_TYPE_HARDWARE' events remains 4. The same is happening with HyperThreading disabled!

    One could ask: why do you need to avoid multiplexing. First of all, the implementation needs to be as much accurate as possible by avoiding the potential blind spots of multiplexing and secondly when the "upper limit" is exceeded all event values are 0...

    The PERF_TYPE_HW_CACHE events I'm targeting are:

    CACHE_LLC_READ(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_READ.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_ACCESS.value << 16),
    CACHE_LLC_WRITE(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_WRITE.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_ACCESS.value << 16),
    CACHE_LLC_READ_MISS(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_READ.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_MISS.value << 16),
    CACHE_LLC_WRITE_MISS(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value  | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_WRITE.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_MISS.value << 16),
    

    all are implemented with the provided formula:

    (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
    (perf_hw_cache_op_result_id << 16)
    

    and are manipulated as a group (the first is the group leader etc).

    So, my questions are the following:

    1. Which counters of the PMU are used for PERF_TYPE_HARDWARE and which for PERF_TYPE_HW_CACHE events and where can I find this information?
    2. What is the difference between the PERF_TYPE_HARDWARE pre-defined events (such as PERF_COUNT_HW_CACHE_MISSES) and the PERF_TYPE_HW_CACHE events?
    3. Any suggestions on how to monitor without multiplexing all listed PERF_TYPE_HW_CACHE events?
    4. Any suggestions on how to monitor without multiplexing up to 8 PERF_TYPE_HARDWARE or/and PERF_TYPE_HW_CACHE events?

    Thanks in advance!

    解决方案

    1. The PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events are mapped to two sets of registers involved in performance monitoring. The first set of MSRs are called IA32_PERFEVTSELx where x can vary from 0 to N-1, N being the total number of general purpose counters available. The PERFEVTSEL is a short for "performance event select", they specify various conditions on the fulfillment of which event counting will happen. The second set of MSRs are called IA32_PMCx, where x varies similarly as PERFEVTSEL. These PMC registers store the counts of performance monitoring events. Each PERFEVTSEL register is paired with a corresponding PMC register.

    The mapping happens as follows-

    At the initialization of the architecture specific portion of the kernel, a pmu for measuring hardware specific events is registered here with type PERF_TYPE_RAW. All PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events are mapped to PERF_TYPE_RAW events to identify the pmu, as can be seen here.

    if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE)
            type = PERF_TYPE_RAW;
    

    The same architecture specific initialization is responsible for setting up the addresses of the first/base registers of each of the aforementioned sets of performance monitoring event registers, here

        .eventsel       = MSR_ARCH_PERFMON_EVENTSEL0,
        .perfctr        = MSR_ARCH_PERFMON_PERFCTR0,
    

    The event_init function specific to the PMU identified, is responsible for setting up and "reserving" the two sets of performance monitoring registers, as well as checking for event constraints etc., here. The reservation happens here.

    for (i = 0; i < x86_pmu.num_counters; i++) {
            if (!reserve_perfctr_nmi(x86_pmu_event_addr(i)))
                goto perfctr_fail;
        }
    
        for (i = 0; i < x86_pmu.num_counters; i++) {
            if (!reserve_evntsel_nmi(x86_pmu_config_addr(i)))
                goto eventsel_fail;
        }
    

    The value num_counters = number of general-purpose counters as identified by CPUID instruction.

    In addition to this, there are a couple of extra registers that monitor offcore events (eg. the LLC-cache specific events).

    In later versions of architectural performance monitoring, some of the hardware events are measured with the help of fixed-purpose registers, as seen here. These are the fixed-purpose registers -

    #define MSR_ARCH_PERFMON_FIXED_CTR0 0x309
    #define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a
    #define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b
    

    1. The PERF_TYPE_HARDWARE pre-defined events are all architectural performance monitoring events. These events are architectural, since the behavior of each architectural performance event is expected to be consistent on all processors that support that event. All of the PERF_TYPE_HW_CACHE events are non-architectural, which means they are model-specific and may vary from one family of processors to another.

    2. For an Intel Kaby Lake machine that I have, a total of 20 PERF_TYPE_HW_CACHE events are pre-defined. The event constraints involved, ensure that the 3 fixed-function counters available are mapped to 3 PERF_TYPE_HARDWARE architectural events. Only one event can be measured on each of the fixed-function counters, so we can discard them for our analysis. The other constraint is that only two events targeting the LLC-caches, can be measured at the same time, since there are only two OFFCORE RESPONSE registers. Also, the nmi-watchdog may pin an event to another counter from the family of general-purpose counters. If the nmi-watchdog is disabled, we are left with 4 general purpose counters.

    Given the constraints involved, and the limited number of counters available, there is just no way to avoid multiplexing if all the 20 hardware cache events are measured at the same time. Some workarounds to measure all the events, without incurring multiplexing and its errors, are -

    3.1. Group all the PERF_TYPE_HW_CACHE events into groups of 4, such that all of the 4 events can be scheduled on each of the 4 general-purpose counters at the same time. Make sure there are no more than 2 LLC cache events in a group. Run the same profile and obtain the counts for each of the groups separately.

    3.2. If all the PERF_TYPE_HW_CACHE events are to be monitored at the same time, then some of the errors with multiplexing can be reduced, by decreasing the value of perf_event_mux_interval_ms. It can be configured via a sysfs entry called /sys/devices/cpu/perf_event_mux_interval_ms. This value cannot be lowered beyond a point, as can be seen here.

    1. Monitoring upto 8 hardware or hardware-cache events would require hyperthreading to be disabled. Note that, the information about the number of general purpose counters available are retrieved using the CPUID instruction and the number of such counters are setup at the architecture initialization portion of the kernel startup via the early_initcall function. This can be seen here. Once the initialization is done, the kernel understands that only 4 counters are available, and any changes in hyperthreading capabilities later, do not make any difference.

    这篇关于PERF_TYPE_HARDWARE和PERF_TYPE_HW_CACHE并发监视的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆