CPU测量(缓存未命中/命中),这是没有意义的 [英] CPU measures (Cache misses/hits) which do not make sense
问题描述
我使用英特尔PCM 进行细粒度CPU测量。在我的代码中,我试图测量缓存效率。
基本上,我首先把一个小数组放入L1缓存(通过遍历它多次)启动定时器,再次超过阵列(希望使用缓存),然后关闭定时器。
PCM显示我有一个高L2和L3丢失率。我也检查了 rdtscp
,每个数组操作的周期是15(这比访问L1缓存的4-5个周期要高得多)。
我期望的是,数组完全放在L1缓存中,我不会有高L1,L2和L3丢失率。
我的系统对于L1,L2和L3分别有32K,256K和25M。
这是我的代码:
static const int ARRAY_SIZE = 16;
struct MyStruct {
struct MyStruct * next;
long int pad;
}; //每个MyStruct是16字节
int main(){
PCM * m = PCM :: getInstance();
PCM :: ErrorCode returnResult = m-> program(PCM :: DEFAULT_EVENTS,NULL);
if(returnResult!= PCM :: Success){
std :: cerr<< 英特尔的PCM无法启动< std :: endl;
exit(1);
}
MyStruct * myS = new MyStruct [ARRAY_SIZE];
//对于(int i = 0; i myS [i] .next =& myS [i + 1];
myS [i] .pad =(long int)i;
}
myS [ARRAY_SIZE - 1] .next = NULL;
myS [ARRAY_SIZE - 1] .pad =(long int)(ARRAY_SIZE - 1);
//填充缓存
MyStruct * current;
for(int i = 0; i <200000; i ++){
current =& myS [0];
while((current = current-> n)!= NULL)
current-> pad + = 1;
}
//顺序访问实验
current =& myS [0];
long sum = 0;
SystemCounterState pre = getSystemCounterState();
while((current = current-> n)!= NULL){
sum + = current-> pad;
}
SystemCounterState after = getSystemCounterState();
cout<< 每个时钟的指令:< getIPC(before,after)<< endl;
cout<< Cycles per op:< getCycles(before,after)/ ARRAY_SIZE<< endl;
cout<< L2 Misses:<< getL2CacheMisses(before,after)<< endl;
cout<< L2 Hits:< getL2CacheHits(before,after)<< endl;
cout<< L2命中率:< getL2CacheHitRatio(before,after)<< endl;
cout<< L3 Misses:<< getL3CacheMisses(before_sstate,after_sstate)<< endl;
cout<< L3 Hits:< getL3CacheHits(before,after)<< endl;
cout<< L3命中率:< getL3CacheHitRatio(before,after)<< endl;
cout<< Sum:< sum< endl;
m-> cleanup();
return 0;
}
这是输出:
每个时钟的说明:0.408456
每次操作周期:553074
L2高速缓存缺失:58775
L2高速缓存命中:11371
L2缓存命中率:0.162105
L3缓存未命中:24164
L3缓存命中:34611
L3缓存命中率:0.588873
EDIT :
我也检查了下面的代码, (我希望得到几乎为零的遗漏率):
SystemCounterState before = getSystemCounterState
//这只是一个注释
SystemCounterState after = getSystemCounterState();
EDIT 2 正如一个建议的,这些结果可能是由于profiler本身的开销。所以我而不是只有一次,我改变了代码遍历数组多次(200,000,000次),以摊销分析器的开销。我仍然得到非常低的L2和L3缓存比率(%15)。
解决方案看起来你得到l2和l3 misses您系统上的所有内核
我在这里查看PCM实现: https://github.com/erikarn/intel-pcm/blob/ecc0cf608dfd9366f4d2d9fa48dc821af1c26f33/src/cpucounters.cpp
[1]在线路1407上执行
PCM :: program()
我没有看到任何将事件限制到特定进程的代码<在线2809上的PCM :: getSystemCounterState()
的实现中,你可以看到事件被收集到了p>
<来自系统上的所有内核。所以我试着将进程的cpu亲和性设置为一个核心,然后只从这个核心读取事件 - 使用此函数CoreCounterState getCoreCounterState(uint32 core)
I use Intel PCM for fine-grained CPU measurements. In my code, I am trying to measure the cache efficiency.
Basically, I first put a small array into the L1 cache (by traversing it many times), then I fire up the timer, go over the array one more time (which hopefully uses the cache), and then turning off the timer.
PCM shows me that I have a rather high L2 and L3 miss ratio. I also checked with
rdtscp
and the cycles per array operation is 15 (which is much higher than 4-5 cycles for accessing L1 cache).What I would expect is that the array is placed entirely in L1 cache, and I wouldn't have high L1, L2 and L3 miss ratio.
My system has 32K, 256K and 25M for L1, L2 and L3 respectively. Here's my code:
static const int ARRAY_SIZE = 16; struct MyStruct { struct MyStruct *next; long int pad; }; // each MyStruct is 16 bytes int main() { PCM * m = PCM::getInstance(); PCM::ErrorCode returnResult = m->program(PCM::DEFAULT_EVENTS, NULL); if (returnResult != PCM::Success){ std::cerr << "Intel's PCM couldn't start" << std::endl; exit(1); } MyStruct *myS = new MyStruct[ARRAY_SIZE]; // Make a sequential liked list, for (int i=0; i < ARRAY_SIZE - 1; i++){ myS[i].next = &myS[i + 1]; myS[i].pad = (long int) i; } myS[ARRAY_SIZE - 1].next = NULL; myS[ARRAY_SIZE - 1].pad = (long int) (ARRAY_SIZE - 1); // Filling the cache MyStruct *current; for (int i = 0; i < 200000; i++){ current = &myS[0]; while ((current = current->n) != NULL) current->pad += 1; } // Sequential access experiment current = &myS[0]; long sum = 0; SystemCounterState before = getSystemCounterState(); while ((current = current->n) != NULL) { sum += current->pad; } SystemCounterState after = getSystemCounterState(); cout << "Instructions per clock: " << getIPC(before, after) << endl; cout << "Cycles per op: " << getCycles(before, after) / ARRAY_SIZE << endl; cout << "L2 Misses: " << getL2CacheMisses(before, after) << endl; cout << "L2 Hits: " << getL2CacheHits(before, after) << endl; cout << "L2 hit ratio: " << getL2CacheHitRatio(before, after) << endl; cout << "L3 Misses: " << getL3CacheMisses(before_sstate,after_sstate) << endl; cout << "L3 Hits: " << getL3CacheHits(before, after) << endl; cout << "L3 hit ratio: " << getL3CacheHitRatio(before, after) << endl; cout << "Sum: " << sum << endl; m->cleanup(); return 0; }
This is the output:
Instructions per clock: 0.408456 Cycles per op: 553074 L2 Cache Misses: 58775 L2 Cache Hits: 11371 L2 cache hit ratio: 0.162105 L3 Cache Misses: 24164 L3 Cache Hits: 34611 L3 cache hit ratio: 0.588873
EDIT: I also checked the following code, and still get the same miss ratios (which I would have expected to get almost zero miss ratios):
SystemCounterState before = getSystemCounterState(); // this is just a comment SystemCounterState after = getSystemCounterState();
EDIT 2: As one commented suggested, these results might be due to the overhead of the profiler itself. So I instead of only one time, I changed the code traverse the array many times (200,000,000 times), to amortize the profiler's overhead. I still get very low L2 and L3 Cache ratios (%15).
解决方案It seems that you get l2 and l3 misses from all cores on your system
I look at the PCM implementation here: https://github.com/erikarn/intel-pcm/blob/ecc0cf608dfd9366f4d2d9fa48dc821af1c26f33/src/cpucounters.cpp
[1] in the implementation of
PCM::program()
on line 1407 I don't see any code that limits events to a specific process[2] in the implementation of
PCM::getSystemCounterState()
on line 2809 you can see that the events are gathered from all cores on your system. So I would try to set cpu affinity of the process to one core and then only read events from this core - with this functionCoreCounterState getCoreCounterState(uint32 core)
这篇关于CPU测量(缓存未命中/命中),这是没有意义的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!