什么是L1缓存未命中的成本? [英] What is the Cost of an L1 Cache Miss?
问题描述
修改:作为参考(如果任何人碰到这个问题绊),伊戈尔·奥斯特洛夫斯基写了的伟大的职位了解高速缓存未命中。它讨论了几种不同的问题,显示了示例的数字。 结束修改
我做了一些测试<很长的故事到这里>
,我想知道,如果性能差异是由于内存高速缓存未命中。下面code演示了问题,归结下来的关键时序部分。下面code有一对夫妇,在随机访问内存循环,然后按升序地址顺序。
我跑了XP的机器(与VS2005编译:CL / O2)上(GCC -Os),并在Linux机器上。双方产生了相似倍。这些时间是毫秒。我相信所有的回路运行,并且不优化掉了(否则会运行立即)。
***测试20000节点
总有序时间:888.822899
总随机时间:2155.846268
这些数字有意义吗?区别主要是由于L1高速缓存未命中或别的东西怎么回事呢?有20,000平方公尺的内存访问,如果每个人都缓存未命中,即每小姐约3.2纳秒。我测试的XP(P4)的机器是3.2GHz的,我怀疑(但不知道)有一个32KB的L1缓存和512KB L2。 20000项(80KB),我认为没有L2未命中的数量显著。因此,这将是(3.2 * 10 ^ 9次/秒)* 3.2 * 10 ^ -9秒/小姐)= 10.1次/小姐
。这似乎是高了我。也许不是,也许我的数学不好。我试图测量高速缓存未命中与VTune™可视化,但我得到了BSOD。现在我无法得到它连接到许可服务器(grrrr)。
typedef结构stItem
{
长LDATA;
//字符acPad [20];
} LIST_NODE;#如果定义(WIN32)
无效startTimer所(LONGLONG * PT1)
{
QueryPerformanceCounter的((LARGE_INTEGER *)PT1);
}无效StopTimer(LONGLONG T1,双* PDMS)
{
LONGLONG T2,llFreq; QueryPerformanceCounter的((LARGE_INTEGER *)及T2);
QueryPerformanceFrequency的((LARGE_INTEGER *)及llFreq);
* PDMS =((双)(T2 - T1)/(双)llFreq)* 1000.0;
}
#其他
//不需要在这种情况下,64位整数
无效startTimer所(LONGLONG * PT1)
{
//只需使用时钟(),这个测试并不需要更高的分辨率
* PT1 =时钟();
}无效StopTimer(LONGLONG T1,双* PDMS)
{
LONGLONG T2 =时钟();
* PDMS =(双)(T2 - T1)/(CLOCKS_PER_SEC / 1000);
}
#万一长longrand()
{
#如果定义(WIN32)
//愚蠢俗气的方式,以确保它不只是一个16位的值兰特
回报(RAND()<< 16)| RAND();
#其他
返回兰特();
#万一
}//在给定范围内获得随机值
INT randint(INT男,INT N)
{
INT RET = longrand()%(N - M + 1);
返回RET +米;
}//我觉得我得到了这一点编程珍珠(宾利)。
无效ShuffleArray
(
长* plShuffle,//(O)的随机返回数组排序整数
阵列的长lNumItems //(I)长
)
{
我长;
龙J;
长吨; 对于(i = 0; I< lNumItems;我++)
plShuffle [我] =我; 对于(i = 0; I< lNumItems;我++)
{
J = randint(ⅰ,lNumItems - 1); T = plShuffle [I]
plShuffle [I] = plShuffle [J]。
plShuffle [J] = T;
}
}INT主(INT ARGC,CHAR *的argv [])
{
长* plDataValues;
LIST_NODE * pstNodes;
长lNumItems = 20000;
长I,J;
LONGLONG T1; //计时
双DMS; 如果(argc个大于1&放大器;&放大器;与atoi(argv的[1])大于0)
lNumItems =的atoi(ARGV [1]); 的printf(\\ n \\ n ***测试%u个节点\\ n,lNumItems); 函数srand((unsigned int类型)时间(0)); //分配节点作为一个内存块单
pstNodes =(LIST_NODE *)malloc的(lNumItems *的sizeof(LIST_NODE));
断言(pstNodes!= NULL); //创建一个数组,给出了节点的访问顺序
plDataValues =(长*)malloc的(lNumItems *的sizeof(长));
断言(plDataValues!= NULL); //访问,以便数据
对于(i = 0; I< lNumItems;我++)
plDataValues [我] =我; startTimer所(安培T1); //循环和访问内存一堆倍
为(J = 0; J< lNumItems; J ++)
{
对于(i = 0; I< lNumItems;我++)
{
。pstNodes [plDataValues [I] LDATA = I *焦耳;
}
} StopTimer(T1,&放大器; DMS);
的printf(总序时间:%F \\ N,DMS); //现在访问随机的顺序排列位置
ShuffleArray(plDataValues,lNumItems); startTimer所(安培T1); 为(J = 0; J< lNumItems; J ++)
{
对于(i = 0; I< lNumItems;我++)
{
。pstNodes [plDataValues [I] LDATA = I *焦耳;
}
} StopTimer(T1,&放大器; DMS);
的printf(总随机时间:%F \\ N,DMS);}
虽然我不能提供一个答案的数字是否没有意义了(我不能很好地在缓存延迟精通,但备案〜 10周期L1高速缓存未命中听起来是正确的),我可以为您提供 Cachegrind 作为一种工具来帮助你实际看到你的2次测试之间的缓存性能的差异。
Cachegrind是Valgrind的工具型材高速缓存和分支命中/未命中(的框架,权力永远可爱MEMCHECK)。它会给你的缓存多少命中/惦记着你实际上是让你的程序中。一个想法
Edit: For reference purposes (if anyone stumbles across this question), Igor Ostrovsky wrote a great post about cache misses. It discusses several different issues and shows example numbers. End Edit
I did some testing <long story goes here>
and am wondering if a performance difference is due to memory cache misses. The following code demonstrates the issue and boils it down to the critical timing portion. The following code has a couple of loops that visit memory in random order and then in ascending address order.
I ran it on an XP machine (compiled with VS2005: cl /O2) and on a Linux box (gcc –Os). Both produced similar times. These times are in milliseconds. I believe all loops are running and are not optimized out (otherwise it would run "instantly").
*** Testing 20000 nodes Total Ordered Time: 888.822899 Total Random Time: 2155.846268
Do these numbers make sense? Is the difference primarily due to L1 cache misses or is something else going on as well? There are 20,000^2 memory accesses and if every one were a cache miss, that is about 3.2 nanoseconds per miss. The XP (P4) machine I tested on is 3.2GHz and I suspect (but don’t know) has a 32KB L1 cache and 512KB L2. With 20,000 entries (80KB), I assume there is not a significant number of L2 misses. So this would be (3.2*10^9 cycles/second) * 3.2*10^-9 seconds/miss) = 10.1 cycles/miss
. That seems high to me. Maybe it’s not, or maybe my math is bad. I tried measuring cache misses with VTune, but I got a BSOD. And now I can’t get it to connect to the license server (grrrr).
typedef struct stItem
{
long lData;
//char acPad[20];
} LIST_NODE;
#if defined( WIN32 )
void StartTimer( LONGLONG *pt1 )
{
QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}
void StopTimer( LONGLONG t1, double *pdMS )
{
LONGLONG t2, llFreq;
QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
QueryPerformanceFrequency( (LARGE_INTEGER*)&llFreq );
*pdMS = ((double)( t2 - t1 ) / (double)llFreq) * 1000.0;
}
#else
// doesn't need 64-bit integer in this case
void StartTimer( LONGLONG *pt1 )
{
// Just use clock(), this test doesn't need higher resolution
*pt1 = clock();
}
void StopTimer( LONGLONG t1, double *pdMS )
{
LONGLONG t2 = clock();
*pdMS = (double)( t2 - t1 ) / ( CLOCKS_PER_SEC / 1000 );
}
#endif
long longrand()
{
#if defined( WIN32 )
// Stupid cheesy way to make sure it is not just a 16-bit rand value
return ( rand() << 16 ) | rand();
#else
return rand();
#endif
}
// get random value in the given range
int randint( int m, int n )
{
int ret = longrand() % ( n - m + 1 );
return ret + m;
}
// I think I got this out of Programming Pearls (Bentley).
void ShuffleArray
(
long *plShuffle, // (O) return array of "randomly" ordered integers
long lNumItems // (I) length of array
)
{
long i;
long j;
long t;
for ( i = 0; i < lNumItems; i++ )
plShuffle[i] = i;
for ( i = 0; i < lNumItems; i++ )
{
j = randint( i, lNumItems - 1 );
t = plShuffle[i];
plShuffle[i] = plShuffle[j];
plShuffle[j] = t;
}
}
int main( int argc, char* argv[] )
{
long *plDataValues;
LIST_NODE *pstNodes;
long lNumItems = 20000;
long i, j;
LONGLONG t1; // for timing
double dms;
if ( argc > 1 && atoi(argv[1]) > 0 )
lNumItems = atoi( argv[1] );
printf( "\n\n*** Testing %u nodes\n", lNumItems );
srand( (unsigned int)time( 0 ));
// allocate the nodes as one single chunk of memory
pstNodes = (LIST_NODE*)malloc( lNumItems * sizeof( LIST_NODE ));
assert( pstNodes != NULL );
// Create an array that gives the access order for the nodes
plDataValues = (long*)malloc( lNumItems * sizeof( long ));
assert( plDataValues != NULL );
// Access the data in order
for ( i = 0; i < lNumItems; i++ )
plDataValues[i] = i;
StartTimer( &t1 );
// Loop through and access the memory a bunch of times
for ( j = 0; j < lNumItems; j++ )
{
for ( i = 0; i < lNumItems; i++ )
{
pstNodes[plDataValues[i]].lData = i * j;
}
}
StopTimer( t1, &dms );
printf( "Total Ordered Time: %f\n", dms );
// now access the array positions in a "random" order
ShuffleArray( plDataValues, lNumItems );
StartTimer( &t1 );
for ( j = 0; j < lNumItems; j++ )
{
for ( i = 0; i < lNumItems; i++ )
{
pstNodes[plDataValues[i]].lData = i * j;
}
}
StopTimer( t1, &dms );
printf( "Total Random Time: %f\n", dms );
}
While I can't offer an answer to whether or not the numbers make sense (I'm not well versed in the cache latencies, but for the record ~10 cycle L1 cache misses sounds about right), I can offer you Cachegrind as a tool to help you actually see the differences in cache performance between your 2 tests.
Cachegrind is a Valgrind tool (the framework that powers the always-lovely memcheck) which profiles cache and branch hits/misses. It will give you an idea of how many cache hits/misses you are actually getting in your program.
这篇关于什么是L1缓存未命中的成本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!