如何在过程中配置和采样英特尔性能计数器 [英] How to Configure and Sample Intel Performance Counters In-Process
问题描述
简而言之,我正在尝试在用户级基准测试流程(伪代码,假设x86_64和UNIX系统)中实现以下目标:
In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system):
results[] = ...
for (iteration = 0; iteration < num_iterations; iteration++) {
pctr_start = sample_pctr();
the_benchmark();
pctr_stop = sample_pctr();
results[iteration] = pctr_stop - pctr_start;
}
FWIW,我正在考虑使用的性能计数器是CPU_CLK_UNHALTED.THREAD_ALL
,用于读取独立于时钟频率变化的核心周期数(在
FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL
, to read the number of core cycles independent of clock frequency changes (In an earlier question I had been planning to use the TSC register for this, but alas, that is not what this register measures at all).
我最初的意图是使用内联汇编程序首先使用WRMSR
配置计数器,然后使用sample_pctr()
内部的RDPMC
读取计数器.
My initial intention was to use inline assembler to first configure a counter using WRMSR
, then to read the counter using RDPMC
inside sample_pctr()
.
我偶然发现了第一个障碍,因为编写MSR需要内核特权.看来您实际上可以从用户空间读取计数器(如果配置正确),但是配置计数器(带有MSR)的操作需要由内核.
I stumbled at the first hurdle, as writing MSRs requires kernel privileges. It seems like you can in fact read the counters from user space (if configured correctly), but the act of configuring the counter (with an MSR) needs to be undertaken by the kernel.
有人知道一种轻巧的方法来请求内核从用户空间配置性能计数器,以便随后在基准测试工具中使用RDPMC
吗?
Does anyone know a lightweight way to ask the kernel to configure the a performance counters from user-space so that I can then use RDPMC
from within my benchmark harness?
我研究过/考虑过的东西:
Stuff I've looked into/thought about:
- 用于Linux的Perf工具.似乎已准备好在过程的整个生命周期中进行采样,而不是在过程中作为特定点(每次迭代之前和之后)进行采样.
- 直接使用perf系统调用(即
perf_event_open
).看起来计数器值仅会定期更新(使用采样率)或在计数器超过阈值后更新.我问的那一刻,我确实需要对价.这就是RDPMC
如此吸引人的原因.我认为频繁采样本身会使性能计数器的读数产生偏差. - PAPI 建立在perf之上,因此很可能继承了上述问题.
- 编写内核模块-太费力了,容易出错.
- Perf tools for Linux. Seems to be geared up for sampling over the whole lifetime of a process, not within a process as specific points (before and after each iteration).
- Use perf syscalls directly (i.e.
perf_event_open
). Looks like the counter value will only update periodically (using a sample rate) or after the counter exceeds a threshold. I need the counter value precisely at the moment I ask. This is whyRDPMC
seemed so attractive. I imagine that sampling frequently will itself skew the performance counter readings. - PAPI builds on perf, so probably inherits the above problem.
- Write a kernel module -- too much effort, too error prone.
理想情况下,我想要一个可以在OpenBSD和Linux上运行的解决方案,但是我认为这是一个很高的要求.也许目前仅适用于Linux.
Ideally I would like a solution which works on OpenBSD and Linux, but somehow I think that is a tall order. Perhaps just for Linux for now.
我们非常感谢您的帮助.谢谢.
Any help is most appreciated. Thanks.
我刚刚找到了 Linux msr设备节点,这可能就足够了.如果出现更好的答案,我将保留该问题.
I just found the Linux msr device node, which would probably suffice. I'll leave the question up in case a better answer shows up.
推荐答案
It seems the best way -- for Linux at least -- is to use the msr device node.
您只需打开设备节点,查找所需的MSR地址,然后读取或写入8个字节即可.
You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.
OpenBSD更加困难,因为(在编写本文时)没有用户空间代理到MSR.因此,您需要手动编写内核模块或实现sysctl.
OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.
这篇关于如何在过程中配置和采样英特尔性能计数器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!