CPU测量（缓存未命中/命中），这是没有意义的 [英] CPU measures (Cache misses/hits) which do not make sense

查看：424 发布时间：2016/11/2 23:17:23 c++ caching cpu performancecounter cpu-cache

本文介绍了CPU测量（缓存未命中/命中），这是没有意义的的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用英特尔PCM 进行细粒度CPU测量。在我的代码中，我试图测量缓存效率。

基本上，我首先把一个小数组放入L1缓存（通过遍历它多次）启动定时器，再次超过阵列（希望使用缓存），然后关闭定时器。

PCM显示我有一个高L2和L3丢失率。我也检查了 rdtscp ，每个数组操作的周期是15（这比访问L1缓存的4-5个周期要高得多）。

我期望的是，数组完全放在L1缓存中，我不会有高L1，L2和L3丢失率。

我的系统对于L1，L2和L3分别有32K，256K和25M。
这是我的代码：

  static const int ARRAY_SIZE = 16; 
 
 struct MyStruct {
 struct MyStruct * next; 
 long int pad; 
}; //每个MyStruct是16字节
 
 int main（）{
 PCM * m = PCM :: getInstance（）; 
 PCM :: ErrorCode returnResult = m-> program（PCM :: DEFAULT_EVENTS，NULL）; 
 if（returnResult！= PCM :: Success）{
 std :: cerr<< 英特尔的PCM无法启动< std :: endl; 
 exit（1）; 
} 
 
 MyStruct * myS = new MyStruct [ARRAY_SIZE]; 
 
 //对于（int i = 0; i  myS [i] .next =& myS [i + 1]; 
 myS [i] .pad =（long int）i; 
} 
 myS [ARRAY_SIZE  -  1] .next = NULL; 
 myS [ARRAY_SIZE  -  1] .pad =（long int）（ARRAY_SIZE  -  1）; 
 
 //填充缓存
 MyStruct * current; 
 for（int i = 0; i <200000; i ++）{
 current =& myS [0]; 
 while（（current = current-> n）！= NULL）
 current-> pad + = 1; 
} 
 
 //顺序访问实验
 current =& myS [0]; 
 long sum = 0; 
 
 SystemCounterState pre = getSystemCounterState（）; 
 
 while（（current = current-> n）！= NULL）{
 sum + = current-> pad; 
} 
 
 SystemCounterState after = getSystemCounterState（）; 
 
 cout<< 每个时钟的指令：< getIPC（before，after）<< endl; 
 cout<< Cycles per op：< getCycles（before，after）/ ARRAY_SIZE<< endl; 
 cout<< L2 Misses：<< getL2CacheMisses（before，after）<< endl; 
 cout<< L2 Hits：< getL2CacheHits（before，after）<< endl; 
 cout<< L2命中率：< getL2CacheHitRatio（before，after）<< endl; 
 cout<< L3 Misses：<< getL3CacheMisses（before_sstate，after_sstate）<< endl; 
 cout<< L3 Hits：< getL3CacheHits（before，after）<< endl; 
 cout<< L3命中率：< getL3CacheHitRatio（before，after）<< endl; 
 
 cout<< Sum：< sum< endl; 
 m-> cleanup（）; 
 return 0; 
}

这是输出：

 每个时钟的说明：0.408456 
每次操作周期：553074 
 L2高速缓存缺失：58775 
 L2高速缓存命中：11371 
 L2缓存命中率：0.162105 
 L3缓存未命中：24164 
 L3缓存命中：34611 
 L3缓存命中率：0.588873

EDIT ：
我也检查了下面的代码，（我希望得到几乎为零的遗漏率）：

  SystemCounterState before = getSystemCounterState 
 //这只是一个注释
 SystemCounterState after = getSystemCounterState（）; 
  
 
 
 
 
 
   EDIT 2 正如一个建议的，这些结果可能是由于profiler本身的开销。所以我而不是只有一次，我改变了代码遍历数组多次（200,000,000次），以摊销分析器的开销。我仍然得到非常低的L2和L3缓存比率（％15）。
解决方案
看起来你得到l2和l3 misses您系统上的所有内核
 
 
 我在这里查看PCM实现： https://github.com/erikarn/intel-pcm/blob/ecc0cf608dfd9366f4d2d9fa48dc821af1c26f33/src/cpucounters.cpp  
 
 
  [1]在线路1407上执行 PCM :: program（）我没有看到任何将事件限制到特定进程的代码<在线2809上的 PCM :: getSystemCounterState（）的实现中，你可以看到事件被收集到了p> 
 
 <来自系统上的所有内核。所以我试着将进程的cpu亲和性设置为一个核心，然后只从这个核心读取事件 - 使用此函数 CoreCounterState getCoreCounterState（uint32 core） 
 
I use Intel PCM for fine-grained CPU measurements. In my code, I am trying to measure the cache efficiency.

Basically, I first put a small array into the L1 cache (by traversing it many times), then I fire up the timer, go over the array one more time (which hopefully uses the cache), and then turning off the timer.

PCM shows me that I have a rather high L2 and L3 miss ratio. I also checked with  rdtscp and the cycles per array operation is 15 (which is much higher than 4-5 cycles for accessing L1 cache).

What I would expect is that the array is placed entirely in L1 cache, and I wouldn't have high L1, L2 and L3 miss ratio.

My system has 32K, 256K and 25M for L1, L2 and L3 respectively. 
Here's my code:
static const int ARRAY_SIZE = 16;

struct MyStruct {
    struct MyStruct *next;
    long int pad;
}; // each MyStruct is 16 bytes

int main() {
    PCM * m = PCM::getInstance();
    PCM::ErrorCode returnResult = m->program(PCM::DEFAULT_EVENTS, NULL);
    if (returnResult != PCM::Success){
        std::cerr << "Intel's PCM couldn't start" << std::endl;
        exit(1);
    }

    MyStruct *myS = new MyStruct[ARRAY_SIZE];

    // Make a sequential liked list,
    for (int i=0; i < ARRAY_SIZE - 1; i++){
        myS[i].next = &myS[i + 1];
        myS[i].pad = (long int) i;
    }
    myS[ARRAY_SIZE - 1].next = NULL;
    myS[ARRAY_SIZE - 1].pad = (long int) (ARRAY_SIZE - 1);

    // Filling the cache
    MyStruct *current;
    for (int i = 0; i < 200000; i++){
        current = &myS[0];
        while ((current = current->n) != NULL)
            current->pad += 1;
    }

    // Sequential access experiment
    current = &myS[0];
    long sum = 0;

    SystemCounterState before = getSystemCounterState();

    while ((current = current->n) != NULL) {
        sum += current->pad;
    }

    SystemCounterState after = getSystemCounterState();

    cout << "Instructions per clock: " << getIPC(before, after) << endl;
    cout << "Cycles per op: " << getCycles(before, after) / ARRAY_SIZE << endl;
    cout << "L2 Misses:     " << getL2CacheMisses(before, after) << endl;
    cout << "L2 Hits:       " << getL2CacheHits(before, after) << endl; 
    cout << "L2 hit ratio:  " << getL2CacheHitRatio(before, after) << endl;
    cout << "L3 Misses:     " << getL3CacheMisses(before_sstate,after_sstate) << endl;
    cout << "L3 Hits:       " << getL3CacheHits(before, after) << endl;
    cout << "L3 hit ratio:  " << getL3CacheHitRatio(before, after) << endl;

    cout << "Sum:   " << sum << endl;
    m->cleanup();
    return 0;
}
This is the output:
Instructions per clock: 0.408456
Cycles per op:        553074
L2 Cache Misses:      58775
L2 Cache Hits:        11371
L2 cache hit ratio:   0.162105
L3 Cache Misses:      24164
L3 Cache Hits:        34611
L3 cache hit ratio:   0.588873




EDIT:
I also checked the following code, and still get the same miss ratios (which I would have expected to get almost zero miss ratios):
SystemCounterState before = getSystemCounterState();
// this is just a comment
SystemCounterState after = getSystemCounterState();




EDIT 2: As one commented suggested, these results might be due to the overhead of the profiler itself. So I instead of only one time, I changed the code traverse the array many times (200,000,000 times), to amortize the profiler's overhead. I still get very low L2 and L3 Cache ratios (%15).
 解决方案 
It seems that you get l2 and l3 misses from all cores on your system

I look at the PCM implementation here: https://github.com/erikarn/intel-pcm/blob/ecc0cf608dfd9366f4d2d9fa48dc821af1c26f33/src/cpucounters.cpp

[1] in the implementation of PCM::program() on line 1407 I don't see any code that limits events to a specific process 

[2] in the implementation of PCM::getSystemCounterState() on line 2809 you can see that the events are gathered from all cores on your system. So I would try to set cpu affinity of the process to one core and then only read events from this core - with this function CoreCounterState getCoreCounterState(uint32 core) 

                        这篇关于CPU测量（缓存未命中/命中），这是没有意义的的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CPU测量（缓存未命中/命中），这是没有意义的 [英] CPU measures (Cache misses/hits) which do not make sense

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

CPU测量（缓存未命中/命中），这是没有意义的 [英] CPU measures (Cache misses/hits) which do not make sense

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭