为什么Perf和Papi为L3缓存引用和未命中赋予不同的值? [英] Why does Perf and Papi give different values for L3 cache references and misses?

查看:152
本文介绍了为什么Perf和Papi为L3缓存引用和未命中赋予不同的值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个项目,在这个项目中,我们必须实现一种理论上证明对缓存友好的算法。简而言之,如果 N 是输入,而 B 是在缓存和缓存之间传输的元素数。每次遇到高速缓存未命中时都需要RAM,该算法将需要 O(N / B)访问RAM。



<我想证明这确实是实践中的行为。为了更好地理解如何测量与缓存相关的各种硬件计数器,我决定使用不同的工具。一个是 Perf ,另一个是 PAPI 库。不幸的是,我使用这些工具的次数越多,对它们确切功能的了解就越少。



我使用的是Intel(R)Core™i5-3470 CPU @ 3.20GHz,具有8 GB的RAM,L1高速缓存256 KB,L2高速缓存1 MB, L3缓存6 MB。缓存行大小为64个字节。我想那一定是块 B 的大小。



让我们看下面的示例:

  #include< iostream> 

使用命名空间std;

结构节点{
int l,r;
};

int main(int argc,char * argv []){

int n = 1000000;

节点* A =新节点[n];

int i;
for(i = 0; i A [i] .l = 1;
A [i] .r = 4;
}

返回0;
}

每个节点需要8个字节,这意味着高速缓存行可以容纳8个节点,因此我应该期望大约 1000000/8 = 125000 L3缓存未命中。



未经优化(无 -O3 ),这是perf的输出:

  perf stat -B -e缓存引用,缓存丢失./cachetests 

'./cachetests'的性能计数器统计信息:

162,813缓存引用
142,247缓存丢失#所有缓存引用的87.368%

0.007163021秒已过去的时间

非常接近我们的期望。现在,假设我们使用PAPI库。

  #include< iostream> 
#include< papi.h>

使用命名空间std;

结构节点{
int l,r;
};

void handle_error(int err){
std :: cerr<< PAPI错误:<<错误<< std :: endl;
}

int main(int argc,char * argv []){

int numEvents = 2;
个长整型值[2];
个int事件[2] = {PAPI_L3_TCA,PAPI_L3_TCM};

if(PAPI_start_counters(events,numEvents)!= PAPI_OK)
handle_error(1);

int n = 1000000;
节点* A =新节点[n];
int i;
for(i = 0; i A [i] .l = 1;
A [i] .r = 4;
}

if(PAPI_stop_counters(value,numEvents)!= PAPI_OK)
handle_error(1);

cout< L3访问:<< values [0]<< endl;
cout< L3未命中:<< values [1]< endl;
cout << L3丢失/访问比率:<<(double)values [1] / values [0] << endl;

返回0;
}

这是我得到的输出:

  L3次访问:3335 
L3次访问:848
L3次访问/访问比率:0.254273

为什么两个工具之间有如此大的差异?

解决方案

您可以遍历perf和PAPI的源文件来找出它们将这些事件实际映射到哪个性能计数器,但事实证明它们是相同的(假设此处为Intel Core i):事件 2E ,带有umask 4F 作为参考, 41 。在《 Intel 64和IA-32体系结构开发人员手册》 这些事件描述为:


2EH 4FH LONGEST_LAT_CACHE.REFERENCE此事件计数源自引用最后一级缓存中的缓存行的核心请求。



2EH 41H LONGEST_LAT_CACHE.MISS此事件计算每个缓存未命中情况


这似乎是可以的。所以问题出在别的地方。



这是我的转载数字,只是我将数组长度增加了100倍。(我注意到时序结果有很大的波动并且长度为1,000,000,该阵列几乎仍可以装入您的L3缓存中)。 main1 这是您的第一个没有PAPI的代码示例,而 main2 是您的第二个具有PAPI的代码示例。

  $ perf stat -e cache-references,cache-misses ./main1 

'./main1'的性能计数器统计信息:

27.148.932缓存引用
22.233.713缓存未命中#81,895%的所有缓存引用

0,885166681秒的经过时间

$ ./main2
L3次访问:7084911
L3次未命中:2750883
L3次未命中/访问比率:0.388273

这些显然不匹配。让我们看看我们实际计算LLC引用的位置。这是 perf记录的前几行,在 perf record -e cache-references ./main1 之后:

  31,22%main1 [kernel] [k] 0xffffffff813fdd87▒
16,79%main1 main1 [。] main▒
6,22%main1 [kernel] [k] 0xffffffff8182dd24▒
5,72%main1 [kernel] [k] 0xffffffff811b541d▒
3,11%main1 [kernel] [k] 0xffffffff811947e9 ▒
1,53%main1 [kernel] [k] 0xffffffff811b5454▒
1,28%main1 [kernel] [k] 0xffffffff811b638a
1,24%main1 [kernel] [k] 0xffffffff811b6381 ▒
1,20%main1 [内核] [k] 0xffffffff811b5417▒
1,20%main1 [内核] [k] 0xffffffff811947c9 ▒
1,07%main1 [内核] [k] 0xffffffff811947ab▒
0,96%main1 [内核] [k] 0xffffffff81194799▒
0,87%main1 [内核] [k] 0xffffffff811947dc

所以您在这里看到的是实际上只有16.79%的缓存引用实际发生在用户中



这就是问题所在。将此与PAPI结果进行比较是不公平的,因为默认情况下,PAPI仅计算用户空间事件。但是,默认情况下,Perf会收集用户和内核空间事件。



对于perf来说,我们可以轻松地减少到仅收集用户空间:

  $ perf stat -e cache-references:u,cache-misses:u ./main1 

'./main1'的性能计数器统计信息:

7.170.190缓存引用:u
2.764.248缓存缺失:u#所有缓存引用的38,552%

0,658690600秒经过的时间

这些似乎匹配得很好。



编辑:



让我们更仔细地了解内核的功能,这次使用调试符号和高速缓存未命中而不是引用:

  59.64%main1 [kernel] [k] clear_page_c_e 
23.25%main1 main1 [。] main
2,71%main1 [kernel ] [k] compaction_alloc
2,70%main1 [kernel] [k] pageblock_pfn_to_page
2,38%main1 [kernel] [k] get_pfnblock_flags_mask
1,57%main1 [kernel] [k] _raw_spin_lock
1,23%main1 [kernel] [k] clear_huge_page
1, 00%main1 [kernel] [k] get_page_from_freelist
0,89%main1 [kernel] [k] free_pages_prepare

我们可以看到大多数高速缓存未命中实际上发生在 clear_page_c_e 中。当我们的程序访问新页面时,将调用此方法。如注释中所述,新页面在允许访问之前已由内核清零,因此缓存未命中已在此处发生。



这与您的分析混淆了,因为您预期的高速缓存未命中发生在内核空间中。但是,您不能保证内核在哪种确切情况下实际访问内存,因此可能与您的代码预期的行为有所偏差。



为避免这种情况,会产生一个额外的循环围绕您的数组填充之一。只有内部循环的第一次迭代才会产生内核开销。一旦访问数组中的每个页面,就不应再有任何贡献了。这是我对外部循环进行100次重复的结果:

  $ perf stat -e cache-references:u,cache-references :k,cache-misses:u,cache-misses:k ./main1 

'./main1'的性能计数器统计信息:

1.327.599.357缓存引用: u
23.678.135高速缓存引用:k
1.242.836.730高速缓存丢失:u#所有高速缓存引用的93,615%
22.572.764高速缓存丢失:k#所有高速缓存的95,332% refs

38,286354681秒经过的时间

数组长度为100,000,000,其中100次迭代,因此根据您的分析,您可能希望有1,250,000,000个缓存未命中。现在已经很近了。差异主要是由于第一个循环在内核在页面清除期间加载到缓存中。



使用PAPI,可以在插入第一个循环之前插入一些额外的预热循环。计数器开始,因此结果更符合期望:

  $ ./main2 
L3访问权限:1318699729
L3错过率:1250684880
L3错过/访问率:0.948423


I am working on a project where we have to implement an algorithm that is proven in theory to be cache friendly. In simple terms, if N is the input and B is the number of elements that get transferred between the cache and the RAM every time we have a cache miss, the algorithm will require O(N/B) accesses to the RAM.

I would like to show that this is indeed the behavior in practice. To better understand how one can measure various cache related hardware counters, I decided to use different tools. One is Perf and the other is the PAPI library. Unfortunately, the more I work with these tools, the less I understand what they do exactly.

I am using an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB. The cache line size is 64 bytes. I guess that must be the size of the block B.

Let's look at the following example:

#include <iostream>

using namespace std;

struct node{
    int l, r;
};

int main(int argc, char* argv[]){

    int n = 1000000;

    node* A = new node[n];

    int i;
    for(i=0;i<n;i++){
        A[i].l = 1;
        A[i].r = 4;
    }

    return 0;
}

Each node requires 8 bytes, which means that a cache line can fit 8 nodes, so I should be expecting approximately 1000000/8 = 125000 L3 cache misses.

Without optimization (no -O3), this is the output from perf:

 perf stat -B -e cache-references,cache-misses ./cachetests 

 Performance counter stats for './cachetests':

       162,813      cache-references                                            
       142,247      cache-misses              #   87.368 % of all cache refs    

   0.007163021 seconds time elapsed

It is pretty close to what we are expecting. Now suppose that we use the PAPI library.

#include <iostream>
#include <papi.h>

using namespace std;

struct node{
    int l, r;
};

void handle_error(int err){
    std::cerr << "PAPI error: " << err << std::endl;
}

int main(int argc, char* argv[]){

    int numEvents = 2;
    long long values[2];
    int events[2] = {PAPI_L3_TCA,PAPI_L3_TCM};

    if (PAPI_start_counters(events, numEvents) != PAPI_OK)
        handle_error(1);

    int n = 1000000;
    node* A = new node[n];
    int i;
    for(i=0;i<n;i++){
        A[i].l = 1;
        A[i].r = 4;
    }

    if ( PAPI_stop_counters(values, numEvents) != PAPI_OK)
        handle_error(1);

    cout<<"L3 accesses: "<<values[0]<<endl;
    cout<<"L3 misses: "<<values[1]<<endl;
    cout<<"L3 miss/access ratio: "<<(double)values[1]/values[0]<<endl;

    return 0;
}

This is the output that I get:

L3 accesses: 3335
L3 misses: 848
L3 miss/access ratio: 0.254273

Why such a big difference between the two tools?

解决方案

You can go through the source files of both perf and PAPI to find out to which performance counter they actually map these events, but it turns out they are the same (assuming Intel Core i here): Event 2E with umask 4F for references and 41 for misses. In the the Intel 64 and IA-32 Architectures Developer's Manual these events are described as:

2EH 4FH LONGEST_LAT_CACHE.REFERENCE This event counts requests originating from the core that reference a cache line in the last level cache.

2EH 41H LONGEST_LAT_CACHE.MISS This event counts each cache miss condition for references to the last level cache.

That seems to be ok. So the problem is somewhere else.

Here are my reproduced numbers, only that I increased the array length by a factor of 100. (I noticed large fluctuations in timing results otherwise and with length of 1,000,000 the array almost fits into your L3 cache still). main1 here is your first code example without PAPI and main2 your second one with PAPI.

$ perf stat -e cache-references,cache-misses ./main1 

 Performance counter stats for './main1':

        27.148.932      cache-references                                            
        22.233.713      cache-misses              #   81,895 % of all cache refs 

       0,885166681 seconds time elapsed

$ ./main2 
L3 accesses: 7084911
L3 misses: 2750883
L3 miss/access ratio: 0.388273

These obviously don't match. Let's see where we actually count the LLC references. Here are the first few lines of perf report after perf record -e cache-references ./main1:

  31,22%  main1    [kernel]          [k] 0xffffffff813fdd87                                                                                                                                   ▒
  16,79%  main1    main1             [.] main                                                                                                                                                 ▒
   6,22%  main1    [kernel]          [k] 0xffffffff8182dd24                                                                                                                                   ▒
   5,72%  main1    [kernel]          [k] 0xffffffff811b541d                                                                                                                                   ▒
   3,11%  main1    [kernel]          [k] 0xffffffff811947e9                                                                                                                                   ▒
   1,53%  main1    [kernel]          [k] 0xffffffff811b5454                                                                                                                                   ▒
   1,28%  main1    [kernel]          [k] 0xffffffff811b638a                                              
   1,24%  main1    [kernel]          [k] 0xffffffff811b6381                                                                                                                                   ▒
   1,20%  main1    [kernel]          [k] 0xffffffff811b5417                                                                                                                                   ▒
   1,20%  main1    [kernel]          [k] 0xffffffff811947c9                                                                                                                                   ▒
   1,07%  main1    [kernel]          [k] 0xffffffff811947ab                                                                                                                                   ▒
   0,96%  main1    [kernel]          [k] 0xffffffff81194799                                                                                                                                   ▒
   0,87%  main1    [kernel]          [k] 0xffffffff811947dc   

So what you can see here is that actually only 16.79% of the cache references actually happen in user space, the rest are due to the kernel.

And here lies the problem. Comparing this to the PAPI result is unfair, because PAPI by default only counts user space events. Perf however by default collects user and kernel space events.

For perf we can easily reduce to user space collection only:

$ perf stat -e cache-references:u,cache-misses:u ./main1 

 Performance counter stats for './main1':

         7.170.190      cache-references:u                                          
         2.764.248      cache-misses:u            #   38,552 % of all cache refs    

       0,658690600 seconds time elapsed

These seem to match pretty well.

Edit:

Lets look a bit closer at what the kernel does, this time with debug symbols and cache misses instead of references:

  59,64%  main1    [kernel]       [k] clear_page_c_e
  23,25%  main1    main1          [.] main
   2,71%  main1    [kernel]       [k] compaction_alloc
   2,70%  main1    [kernel]       [k] pageblock_pfn_to_page
   2,38%  main1    [kernel]       [k] get_pfnblock_flags_mask
   1,57%  main1    [kernel]       [k] _raw_spin_lock
   1,23%  main1    [kernel]       [k] clear_huge_page
   1,00%  main1    [kernel]       [k] get_page_from_freelist
   0,89%  main1    [kernel]       [k] free_pages_prepare

As we can see most cache misses actually happen in clear_page_c_e. This is called when a new page is accessed by our program. As explained in the comments new pages are zeroed by the kernel before allowing access, therefore the cache miss already happens here.

This messes with your analysis, because a good part of the cache misses you expect happen in kernel space. However you can not guarantee under which exact circumstances the kernel actually accesses memory, so that might be deviations from the behavior expected by your code.

To avoid this build an additional loop around your array-filling one. Only the first iteration of the inner loop incurs the kernel overhead. As soon as every page in the array was accessed, there should be no contribution left. Here is my result for 100 repetition of the outer loop:

$ perf stat -e cache-references:u,cache-references:k,cache-misses:u,cache-misses:k ./main1

 Performance counter stats for './main1':

     1.327.599.357      cache-references:u                                          
        23.678.135      cache-references:k                                          
     1.242.836.730      cache-misses:u            #   93,615 % of all cache refs    
        22.572.764      cache-misses:k            #   95,332 % of all cache refs    

      38,286354681 seconds time elapsed

The array length was 100,000,000 with 100 iterations and therefore you would have expected 1,250,000,000 cache misses by your analysis. This is pretty close now. The deviation is mostly from the first loop which is loaded to the cache by the kernel during page clearing.

With PAPI a few extra warm-up loops can be inserted before the counter starts, and so the result fits the expectation even better:

$ ./main2 
L3 accesses: 1318699729
L3 misses: 1250684880
L3 miss/access ratio: 0.948423

这篇关于为什么Perf和Papi为L3缓存引用和未命中赋予不同的值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆