DRAM访问性能计数器 [英] Performance Counters for DRAM Accesses

查看:84
本文介绍了DRAM访问性能计数器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的应用程序中检索 DRAM访问次数.确切地说,我需要在数据访问和代码访问之间区别.处理器是2.60GHz的 Intel(R)Core(TM)i7-4720HQ CPU ( Haswell ).基于

I want to retrieve the number of DRAM accesses in my application. Precisely, I need to distinguish between data and code accesses. The processor is an Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz (Haswell). Based on Intel Software Developer's Manual, Volume 3 and Perf, I could find and categorize the following memory-access-related events:

(A)
LLC-load-misses                                    [Hardware cache event]
LLC-loads                                          [Hardware cache event]
LLC-store-misses                                   [Hardware cache event]
LLC-stores                                         [Hardware cache event]
=========================================================================
(B)
mem_load_uops_l3_miss_retired.local_dram          
mem_load_uops_retired.l3_miss  
=========================================================================
(C)
offcore_response.all_code_rd.l3_miss.any_response 
offcore_response.all_code_rd.l3_miss.local_dram   
offcore_response.all_data_rd.l3_miss.any_response 
offcore_response.all_data_rd.l3_miss.local_dram   
offcore_response.all_reads.l3_miss.any_response   
offcore_response.all_reads.l3_miss.local_dram     
offcore_response.all_requests.l3_miss.any_response
=========================================================================
(D)
offcore_response.all_rfo.l3_miss.any_response     
offcore_response.all_rfo.l3_miss.local_dram       
=========================================================================
(E)
offcore_response.demand_code_rd.l3_miss.any_response
offcore_response.demand_code_rd.l3_miss.local_dram
offcore_response.demand_data_rd.l3_miss.any_response
offcore_response.demand_data_rd.l3_miss.local_dram
offcore_response.demand_rfo.l3_miss.any_response  
offcore_response.demand_rfo.l3_miss.local_dram    
=========================================================================
(F)
offcore_response.pf_l2_code_rd.l3_miss.any_response
offcore_response.pf_l2_data_rd.l3_miss.any_response
offcore_response.pf_l2_rfo.l3_miss.any_response   
offcore_response.pf_l3_code_rd.l3_miss.any_response
offcore_response.pf_l3_data_rd.l3_miss.any_response
offcore_response.pf_l3_rfo.l3_miss.any_response

我的选择如下:

  • 似乎 LLC-load-misses LLC-store-misses 之和将返回整个 DRAM访问(等效地,我可以使用 Perf 中的 LLC缺失).
  • 对于仅数据访问,我使用了 mem_load_uops_retired.l3_miss .它不包括商店,但似乎确定(因为商店似乎变得很多 不那么频繁?!).
  • 简单地说, LLC-load-misses - mem_load_uops_retired.l3_miss =代码的DRAM访问(因为代码是只读).
  • It seems that the sum of LLC-load-misses and LLC-store-misses will return the whole DRAM accesses (equivalently, I could use LLC-misses in Perf).
  • For data-only accesses, I used mem_load_uops_retired.l3_miss. It does not include stores, but seems to be OK (because stores seem to be much less frequent?!).
  • Simplistically, LLC-load-misses - mem_load_uops_retired.l3_miss = DRAM Accesses for Code (As code is read-only).

这些选择是否合理?

我的其他问题:(第二个是最重要的)

  • 什么是 local_dram any_response ?
  • 起初,看来 group(C) group的 load 事件的高分辨率版本(A).但是我的测试表明,前者组中的事件比后继的发生率频繁得多.例如,在简单基准中, offcore_response.all_reads.l3_miss.any_response 事件的数量是 LLC-负载丢失.
  • 组(E),与需求读取(即所有 non-prefetched 读取)有关.这是否意味着,例如: offcore_response.all_data_rd.l3_miss.any_response - offcore_response.demand_data_rd.l3_miss.any_response = 由预取导致的DRAM读取访问?
  • What are local_dram and any_response?
  • At first, it seems that, group (C), is a higher resolution version of the load events of group (A). But my tests show that the events in the former group is much more frequent than the latter. For example, in a simple benchmark, the number of offcore_response.all_reads.l3_miss.any_response events were twice as many as LLC-load-misses.
  • Group (E), pertains to demand reads (i.e., all non-prefetched reads). Does this mean that, e.g.: offcore_response.all_data_rd.l3_miss.any_response - offcore_response.demand_data_rd.l3_miss.any_response = DRAM read accesses caused by prefeching?

组(D),包括由 Read for Ownership 操作(对于 Cache Coherency 协议)引起的DRAM访问事件.看来与我的问题无关.

Group (D), includes DRAM access events caused by Read for Ownership operations (for Cache Coherency Protocols). It seems irrelevant to my problem.

组(F),计算由 L2缓存 prefetcher 引起的DRAM读取,这与无关我的问题.

Group (F), counts DRAM reads caused by L2-cache prefetcher which is also irrelevant to my problem.

推荐答案

基于对问题的理解,我建议在指定的处理器上使用以下两个事件:

Based on my understanding of the question, I recommend using the following two events on the specified processor:

  • OFFCORE_RESPONSE.ALL_READS.L3_MISS.LOCAL_DRAM :这包括所有可缓存的数据读取和写入事务以及所有代码获取事务,无论该事务是由一条指令(是否退休)或预取或任何类型.每个事件恰好表示对内存控制器的64字节读取请求.
  • OFFCORE_RESPONSE.ALL_CODE_RD.L3_MISS.LOCAL_DRAM :这包括对IMC的所有代码访存.
  • OFFCORE_RESPONSE.ALL_READS.L3_MISS.LOCAL_DRAM: This includes all cacheable data read and write transactions and all code fetch transactions, whether the transaction is initiated by a instruction (retired or not) or a prefetch or any type. Each event represents exactly a 64-byte read request to the memory controller.
  • OFFCORE_RESPONSE.ALL_CODE_RD.L3_MISS.LOCAL_DRAM: This includes all the code fetch accesses to the IMC.

(我认为这两个事件对于不可缓存的代码提取请求都不会发生,但是我尚未对此进行测试,并且文档对此也不清楚.)

(I think both of these event don't occur for uncacheable code fetch requests, but I've not tested this and the documentation is not clear on this.)

数据访问"可以独立于代码访问"来测量.通过从第一个事件中减去第二个事件.这两个事件可以在Haswell的同一逻辑核心上同时进行计数,而无需多路复用.

The "data accesses" can be measured separately from the "code accesses" by subtracting the second event from the first. These two events can be counted simultaneously on the same logical core on Haswell without multiplexing.

当然,还有其他事务会去IMC,但不会被提及的两个事件之一计算在内.其中包括:(1)L3写回;(2)从内核进行的不可缓存的部分读取和写入;(3)完全WCB驱逐;以及(4)来自IO设备的内存访问.根据工作负载,类型(1),(3)和(4)的访问可能会构成对IMC的总访问的很大一部分.

There are of course other transactions that do go to the IMC but are not counted by either of the two mentioned events. These include: (1) L3 writebacks, (2) uncacheable partial reads and writes from cores, (3) full WCB evictions, and (4) memory accesses from IO devices. Depending on the workload, It's not unlikely that accesses of types (1), (3), and (4) may constitute a significant fraction of total accesses to the IMC.

看来,LLC-load-misses和LLC-store-misses的总和返回整个DRAM访问(等效地,我可以使用LLC-misses在Perf中).

It seems that the sum of LLC-load-misses and LLC-store-misses will return the whole DRAM accesses (equivalently, I could use LLC-misses in Perf).

请注意以下几点:

  • 事件 LLC-load-misses 是一个 perf 事件,映射到本机事件 OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.ANY_RESPONSE .
  • li>
  • 事件 LLC-store-misses 被映射到 OFFCORE_RESPONSE.DEMAND_RFO.L3_MISS.ANY_RESPONSE .
  • The event LLC-load-misses is a perf event mapped to the native event OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.ANY_RESPONSE.
  • The event LLC-store-misses is mapped to OFFCORE_RESPONSE.DEMAND_RFO.L3_MISS.ANY_RESPONSE.

这些不是您想要的事件,因为:

These are not the events you want because:

  • ANY_RESPONSE 位指示该事件可能针对以任何单元为目标,而不仅仅是IMC的请求发生.
  • 这些事件对L1数据预取和页面遍历请求进行计数,但不对L2数据预取进行计数.您需要统计所有通常消耗内存带宽的预取.
  • The ANY_RESPONSE bit indicates that the event can occur for requests that target any unit, not just the IMC.
  • These events count L1 data prefetches and page walk requests, but not L2 data prefetches. You'd want to count all prefetches that consume memory bandwdith in general.

对于仅数据访问,我使用了mem_load_uops_retired.l3_miss.确实不包括商店,但似乎还可以(因为商店似乎很多不太频繁?!).

For data-only accesses, I used mem_load_uops_retired.l3_miss. It does not include stores, but seems to be OK (because stores seem to be much less frequent?!).

在Haswell上使用 mem_load_uops_retired.l3_miss 有很多问题:

There are a number of issues with using mem_load_uops_retired.l3_miss on Haswell:

  • 在某些情况下,此事件不可靠,因此,如果有其他选择,应避免使用.否则,分析方法应考虑到此事件计数的潜在不可靠性.
  • 该事件仅发生在来自退休负载的请求中,它忽略了推测负载和所有存储,这可能很重要.
  • 以有意义的方式对该事件和其他事件进行算术运算并不容易.例如,您建议做" LLC-load-misses - mem_load_uops_retired.l3_miss = DRAM对代码的访问".不正确.
  • There are cases where this event is unreliable, so it should be avoided if there are alternatives. Otherwise, the analysis methodology should take in to account the potential unreliability of this event count.
  • The event only occurs for requests from retired loads and it omits speculative loads and all stores, which can be significant.
  • Doing arithmetic with this events and other events in a meaningful way is not easy. For example, your suggestion of doing "LLC-load-misses - mem_load_uops_retired.l3_miss = DRAM Accesses for Code" is incorrect.

什么是local_dram和any_response?

What are local_dram and any_response?

不是所有在L3中未命中的请求都转到IMC.一个典型的示例是内存映射的IO请求.您说过,您只想要发往IMC的核心请求,因此 local_dram 是正确的选择.

Not all requests that miss in the L3 go to the IMC. A typical example is memory-mapped IO requests. You said you only want the core-originated requests that go to the IMC, so local_dram is the right bit.

起初,似乎(C)组是的较高分辨率版本组(A)的负载事件.但我的测试表明,前者比后者更为频繁.例如,在简单基准,数量offcore_response.all_reads.l3_miss.any_response事件的发生次数是和LLC-load-misss一样多.

At first, it seems that, group (C), is a higher resolution version of the load events of group (A). But my tests show that the events in the former group is much more frequent than the latter. For example, in a simple benchmark, the number of offcore_response.all_reads.l3_miss.any_response events were twice as many as LLC-load-misses.

这是正常现象,因为 offcore_response.all_reads.l3_miss.any_response 包含 LLC-load-misses ,并且可以轻易地变大.

This is normal because offcore_response.all_reads.l3_miss.any_response is inclusive of LLC-load-misses and can easily be significantly larger.

组(E)与按需读取(即所有未预取的读取)有关.这是否意味着,例如:offcore_response.all_data_rd.l3_miss.any_response-offcore_response.demand_data_rd.l3_miss.any_response =读取的DRAM预习导致的访问?

Group (E), pertains to demand reads (i.e., all non-prefetched reads). Does this mean that, e.g.: offcore_response.all_data_rd.l3_miss.any_response - offcore_response.demand_data_rd.l3_miss.any_response = DRAM read accesses caused by prefeching?

不,因为:

  • 如上所述的 any_response 位,
  • 该减法仅导致对L2数据加载的预取,而不是对所有数据加载的硬件和软件的预取.

这篇关于DRAM访问性能计数器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆