关于如何对PEBS(基于精确事件的采样)计数器进行编程的良好资源? [英] Good resources on how to program PEBS (Precise event based sampling) counters?

查看：185 发布时间：2020/5/8 19:02:42 performance memory cpu processor perf

本文介绍了关于如何对PEBS(基于精确事件的采样)计数器进行编程的良好资源?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直试图记录程序的所有内存访问，在我看来，这似乎是不可能的.我一直在尝试查看我可以在多大程度上记录内存访问的大部分(如果不是全部).因此，我希望对PEBS计数器进行编程，以便可以看到所收集的内存访问样本数量的变化.我想知道是否可以通过修改PEBS计数器的计数器重置值来做到这一点. (通常为零，但我想将其设置为更高的值)

I have been trying to log all memory accesses of a program, which as I read seems to be impossible. I have been trying to see to what extent can I go to log atleast a major portion of the memory accesses, if not all. So I was looking to program the PEBS counters in such a way that I could see changes in the number of memory access samples collected. I wanted to know if I can do this by modifying the counter-reset value of PEBS counters. (Usually this goes to zero, but I want to set it to a higher value)

因此，我一直希望自己编写这些小便计数器.有没有人有过操纵PEBS柜台的经验?具体来说，我一直在寻找良好的资源来了解如何对其进行编程.我已经阅读了英特尔文档并理解了步骤.但是我想了解一些示例程序.我经历了下面的github repo:-

So I was looking to program these pebs counters on my own. Has anybody had experience manipulating the PEBS counters ? Specifically I was looking for good sources to see how to program them. I have gone through the Intel documentation and understood the steps. But I wanted to understand some sample programs. I have gone through the below github repo :-

https://github.com/pyrovski/powertools

但是我不确定如何以及从哪里开始.我还需要寻找其他好的资源吗?任何有关了解和开始编程的好的资源的建议都将非常有帮助.

But I am not quite sure, how and where to start. Are there any other good sources that I need to look ? Any suggestion for good resources to understand and start programming will be very helpful.

请不要在单次运行中混用跟踪和计时测量.

要使Spec的运行速度最快并且跟踪所有内存访问，这是完全不可能的.一次运行用于计时，另一次(更长，更慢)进行内存访问跟踪.

Please, don't mix tracing and timing measurements in single run.

It is just impossible both to have fastest run of Spec and all memory accesses traced. Do one run for timing and other (longer,slower) for memory access tracing.

在 https://github.com/pyrovski/powertools 中，收集事件的频率受到控制通过pebs_init的reset_val参数:

In https://github.com/pyrovski/powertools the frequency of collected events is controlled by reset_val argument of pebs_init:

https://github.com/pyrovski/powertools/blob /0f66c5f3939a9b7b88ec73f140f1a0892cfba235/msr_pebs.c#L72

void
pebs_init(int nRecords, uint64_t *counter, uint64_t *reset_val ){
    // 1. Set up the precise event buffering utilities.
    //  a.  Place values in the
    //      i.   precise event buffer base,
    //      ii.  precise event index
    //      iii. precise event absolute maximum,
    //      iv.  precise event interrupt threshold,
    //      v.   and precise event counter reset fields
    //      of the DS buffer management area.
    //
    // 2.  Enable PEBS.  Set the Enable PEBS on PMC0 flag 
    //  (bit 0) in IA32_PEBS_ENABLE_MSR.
    //
    // 3.  Set up the IA32_PMC0 performance counter and 
    //  IA32_PERFEVTSEL0 for an event listed in Table 
    //  18-10.

    // IA32_DS_AREA points to 0x58 bytes of memory.  
    // (11 entries * 8 bytes each = 88 bytes.)

    // Each PEBS record is 0xB0 byes long.
...
    pds_area->pebs_counter0_reset       = reset_val[0];
    pds_area->pebs_counter1_reset       = reset_val[1];
    pds_area->pebs_counter2_reset       = reset_val[2];
    pds_area->pebs_counter3_reset       = reset_val[3];
...

    write_msr(0, PMC0, reset_val[0]);
    write_msr(1, PMC1, reset_val[1]);
    write_msr(2, PMC2, reset_val[2]);
    write_msr(3, PMC3, reset_val[3]);

此项目是用于访问PEBS的库，并且项目中未包含其用法示例(因为我发现只有一个

This project is library to access PEBS, and there are no examples of its usage included in project (as I found there is only one disabled test in other projects by tpatki).

检查 intel SDM手册第3B卷(本是用于PEBS编程的唯一好资源)，以了解字段的含义以及PEBS的配置和输出: https://xem. github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-734.html

Check intel SDM Manual Vol 3B (this is the only good resource for PEBS programming) for meaning of the fields and PEBS configuration and output: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-734.html

18.15.7处理器基于事件的采样

18.15.7 Processor Event-Based Sampling

PEBS允许将与一个或多个性能事件相关的精确架构信息保存在精确事件记录缓冲区中，该缓冲区是DS保存区域的一部分(请参见第17.4.9节"BTS和DS保存区域"). 要使用此机制，计数器被配置为在计数了预设数量的事件后溢出.计数器溢出后，处理器会将通用寄存器和EFLAGS寄存器以及指令指针的当前状态复制到精确事件记录缓冲区中的一条记录中.然后，处理器将重置性能计数器中的计数，然后重新启动计数器.当精确事件记录缓冲区接近满时，将产生一个中断，以保存精确事件记录.精确事件不支持循环缓冲区记录. ...启用PEBS的计数器溢出后，PEBS 记录已记录

PEBS permits the saving of precise architectural information associated with one or more performance events in the precise event records buffer, which is part of the DS save area (see Section 17.4.9, "BTS and DS Save Area"). To use this mechanism, a counter is configured to overflow after it has counted a preset number of events. After the counter overflows, the processor copies the current state of the general-purpose and EFLAGS registers and instruction pointer into a record in the precise event records buffer. The processor then resets the count in the performance counter and restarts the counter. When the precise event records buffer is nearly full, an interrupt is generated, allowing the precise event records to be saved. A circular buffer is not supported for precise event records. ... After the PEBS-enabled counter has overflowed, PEBS record is recorded

(因此，重置值可能为负，等于-1000表示每1000个事件，-10表示每10个事件.计数器将递增，并且PEBS在计数器溢出时被写入.)

(So, reset value is probably negative, equal to -1000 to get every 1000th event, -10 to get every 10th event. Counter will increment and PEBS is written at counter overflow.)

和 https: //xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-656.html 18.4.4基于处理器事件的采样(PEBS)表18-10"-在Intel Core中只有L1/L2/DTLB未命中有PEBS事件. (找到适合您CPU的PEBS部分并搜索内存事件.具有PEBS功能的事件确实很少.)

and https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-656.html 18.4.4 Processor Event Based Sampling (PEBS) "Table 18-10" - only L1/L2/DTLB misses have PEBS event in Intel Core. (Find PEBS section for your CPU and search for memory events. PEBS-capable events are really rare.)

因此，要记录更多事件，您可能希望将此函数的reset部分设置为较小的绝对值，例如-50或-10.使用PEBS，这可能会起作用(并尝试perf -e cycles:upp -c 10-不要要求以如此高的频率分析内核，只有用户空间:u，并要求使用:pp进行精确定位，并要求使用-c5>进行-10计数. perf已为MSR和缓冲区解析实现了所有PEBS机制.

So, to have more event recorded you probably want to set reset part of this function to smaller absolute value, like -50 or -10. With PEBS this may work (and try perf -e cycles:upp -c 10 - don't ask to profile kernel with so high frequency, only user-space :u and ask for precise with :pp and ask for -10 counter with -c 10. perf has all PEBS mechanics implemented both for MSR and for buffer parsing).

有关PMU(硬件性能监视单元)的另一个好资源，也来自Intel PMU编程指南..它们对普通的PMU和PEBS都有简短的描述.有公开的"Nehalem Core PMU"，其中大多数仍对较新的CPU有用-

Another good resource for PMU (hardware performance monitoring unit) are also from Intel, PMU Programming Guides. They have short and compact description both of usual PMU and PEBS too. There is public "Nehalem Core PMU", most of it still useful for newer CPUs - https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf (And there are uncore PMU guides: E5-2600 Uncore PMU Guide, 2012 https://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf)

关于PEBS的外部pdf: PMC:设置PEBS-摘自"Black Hat USA 2015-这些不是您祖父的CPU性能计数器"

External pdf about PEBS: https://www.blackhat.com/docs/us-15/materials/us-15-Herath-These-Are-Not-Your-Grand-Daddys-CPU-Performance-Counters-CPU-Hardware-Performance-Counters-For-Security.pdf#page=23 PMCs: Setting Up for PEBS - from "Black Hat USA 2015 - These are Not Your Grand Daddy's CPU Performance Counters"

您可以从简短程序(不是最近的SpecCPU的引用输入)开始，然后使用perf linux工具(perf_events)查找可接受的内存比率请求记录到所有内存请求中.通过将:p和:pp后缀添加到事件说明符record -e event:pp，将PEBS与perf一起使用.也可以尝试 pmu-tools ocperf.py 来简化英特尔事件名称编码.

You may start from short and simple program (not the ref inputs of recent SpecCPU) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. PEBS is used with perf by adding :p and :pp suffix to the event specifier record -e event:pp. Also try pmu-tools ocperf.py for easier intel event name encoding.

请尝试在内存测试中找到具有不同记录比率(1％/10％/50％)的实际(最大)开销(最坏情况下的内存记录开销，请留在

Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory tests like (worst case of memory recording overhead, left part on the Arithmetic Intensity scale of Roofline model - STREAM is BLAS1, GUPS and memlat are almost SpMV; real tasks are usually not so left on the scale):

STREAM测试(对内存的线性访问)，
RandomAccess(GUPS)测试
一些内存延迟测试( 7z的示例，

STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).

您是要跟踪每个加载/存储命令，还是只想记录丢失所有(某些)缓存并发送到PC的主RAM内存(到L3)的请求?

Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?

为什么不希望有开销并且记录了所有内存访问?这是不可能的，因为每个内存访问都必须跟踪几个字节才能记录到内存中.因此，启用内存跟踪(大于10％或mem.access跟踪)将明显限制可用内存带宽，并且程序运行速度会变慢.甚至可以注意到1％的跟踪，但是其影响(开销)较小.

Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes to be recorded to the memory. So, having memory tracing enabled (more than 10% or mem.access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.

您的CPU E5-2620 v4是Broadwell-EP 14nm，因此它可能也具有Intel PT的某些较早版本: https://github.com/01org /processor-trace ，尤其是Andi Kleen在 pt 上的博客: http://halobates.de/blog/p/410 用于Linux的英特尔处理器跟踪的速查表性能和gdb"

Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"

硬件中的PT支持:Broadwell(第5代Core，Xeon v4)，开销更大.没有细粒度的时机.

PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.

PS:研究SpecCPU用于内存访问的学者使用内存访问转储/跟踪，并且转储的生成速度很慢:

PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:

http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf -LLC未记录到脱机分析中，没有从跟踪运行中记录任何时间
http://users.ece. utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf -通过写入其他巨大的跟踪缓冲区以进行定期(稀有)在线聚合来检测所有负载/存储.这样的检测速度是2倍或更慢，特别是对于内存带宽/延迟受限的内核.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (由VSSAD的Intel Corporation的Aamer Jaleel提供)-基于引脚的检测-程序代码已被修改并检测为将内存访问元数据写入缓冲区.这样的检测速度是2倍或更慢，特别是对于内存带宽/等待时间受限的内核.该文件列出并解释了仪器开销和注意事项:

http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:

仪器开销:仪器涉及动态或静态地将额外的代码注入到目标应用程序.附加代码会导致应用程序花费额外的时间执行原始文件应用程序...此外，用于多线程应用程序，仪器可以修改命令的顺序在不同线程之间执行的指令应用.结果，IDS与多线程应用程序缺乏保真度

Instrumentation Overhead: Instrumentation involves injecting extra code dynamically or statically into the target application. The additional code causes an application to spend extra time in executing the original application ... Additionally, for multi-threaded applications, instrumentation can modify the ordering of instructions executed between different threads of the application. As a result, IDS with multi-threaded applications comes at the lack of some fidelity

缺乏投机:只能观察仪器在正确的执行路径上执行的指令.作为结果，IDS可能无法支持错误路径...

Lack of Speculation: Instrumentation only observes instructions executed on the correct path of execution. As a result, IDS may not be able to support wrong-path ...

仅限用户级流量:当前的二进制工具工具仅支持用户级别的工具.因此，内核密集型应用程序不适合用户级别的IDS.

User-level Traffic Only: Current binary instrumentation tools only support user-level instrumentation. Thus, applications that are kernel intensive are unsuitable for user-level IDS.

这篇关于关于如何对PEBS(基于精确事件的采样)计数器进行编程的良好资源?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

关于如何对PEBS(基于精确事件的采样)计数器进行编程的良好资源? [英] Good resources on how to program PEBS (Precise event based sampling) counters?

问题描述

推荐答案

请不要在单次运行中混用跟踪和计时测量.

Please, don't mix tracing and timing measurements in single run.

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

关于如何对PEBS(基于精确事件的采样)计数器进行编程的良好资源? [英] Good resources on how to program PEBS (Precise event based sampling) counters?

问题描述

推荐答案

请不要在单次运行中混用跟踪和计时测量.

Please, don't mix tracing and timing measurements in single run.

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭