什么是L1缓存未命中的成本？ [英] What is the Cost of an L1 Cache Miss?

查看：234 发布时间：2016/8/19 15:33:27 c caching memory-access

本文介绍了什么是L1缓存未命中的成本？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

修改：作为参考（如果任何人碰到这个问题绊），伊戈尔·奥斯特洛夫斯基写了的伟大的职位了解高速缓存未命中。它讨论了几种不同的问题，显示了示例的数字。 结束修改

我做了一些测试＆LT;很长的故事到这里＆GT; ，我想知道，如果性能差异是由于内存高速缓存未命中。下面code演示了问题，归结下来的关键时序部分。下面code有一对夫妇，在随机访问内存循环，然后按升序地址顺序。

我跑了XP的机器（与VS2005编译：CL / O2）上（GCC -Os），并在Linux机器上。双方产生了相似倍。这些时间是毫秒。我相信所有的回路运行，并且不优化掉了（否则会运行立即）。


***测试20000节点
总有序时间：888.822899
总随机时间：2155.846268

这些数字有意义吗？区别主要是由于L1高速缓存未命中或别的东西怎么回事呢？有20,000平方公尺的内存访问，如果每个人都缓存未命中，即每小姐约3.2纳秒。我测试的XP（P4）的机器是3.2GHz的，我怀疑（但不知道）有一个32KB的L1缓存和512KB L2。 20000项（80KB），我认为没有L2未命中的数量显著。因此，这将是（3.2 * 10 ^ 9次/秒）* 3.2 * 10 ^ -9秒/小姐）= 10.1次/小姐。这似乎是高了我。也许不是，也许我的数学不好。我试图测量高速缓存未命中与VTune™可视化，但我得到了BSOD。现在我无法得到它连接到许可服务器（grrrr）。

  typedef结构stItem
{
   长LDATA;
   //字符acPad [20];
} LIST_NODE;＃如果定义（WIN32）
无效startTimer所（LONGLONG * PT1）
{
   QueryPerformanceCounter的（（LARGE_INTEGER *）PT1）;
}无效StopTimer（LONGLONG T1，双* PDMS）
{
   LONGLONG T2，llFreq;   QueryPerformanceCounter的（（LARGE_INTEGER *）及T2）;
   QueryPerformanceFrequency的（（LARGE_INTEGER *）及llFreq）;
   * PDMS =（（双）（T2  -  T1）/（双）llFreq）* 1000.0;
}
＃其他
//不需要在这种情况下，64位整数
无效startTimer所（LONGLONG * PT1）
{
   //只需使用时钟（），这个测试并不需要更高的分辨率
   * PT1 =时钟（）;
}无效StopTimer（LONGLONG T1，双* PDMS）
{
   LONGLONG T2 =时钟（）;
   * PDMS =（双）（T2  -  T1）/（CLOCKS_PER_SEC / 1000）;
}
＃万一长longrand（）
{
   ＃如果定义（WIN32）
   //愚蠢俗气的方式，以确保它不只是一个16位的值兰特
   回报（RAND（）＆LT;＆LT; 16）| RAND（）;
   ＃其他
   返回兰特（）;
   ＃万一
}//在给定范围内获得随机值
INT randint（INT男，INT N）
{
   INT RET = longrand（）％（N  -  M + 1）;
   返回RET +米;
}//我觉得我得到了这一点编程珍珠（宾利）。
无效ShuffleArray
（
   长* plShuffle，//（O）的随机返回数组排序整数
   阵列的长lNumItems //（I）长
）
{
   我长;
   龙J;
   长吨;   对于（i = 0; I＆LT; lNumItems;我++）
      plShuffle [我] =我;   对于（i = 0; I＆LT; lNumItems;我++）
      {
      J = randint（ⅰ，lNumItems  -  1）;      T = plShuffle [I]
      plShuffle [I] = plShuffle [J]。
      plShuffle [J] = T;
      }
}INT主（INT ARGC，CHAR *的argv []）
{
   长* plDataValues;
   LIST_NODE * pstNodes;
   长lNumItems = 20000;
   长I，J;
   LONGLONG T1; //计时
   双DMS;   如果（argc个大于1＆放大器;＆放大器;与atoi（argv的[1]）大于0）
      lNumItems =的atoi（ARGV [1]）;   的printf（\\ n \\ n ***测试％u个节点\\ n，lNumItems）;   函数srand（（unsigned int类型）时间（0））;   //分配节点作为一个内存块单
   pstNodes =（LIST_NODE *）malloc的（lNumItems *的sizeof（LIST_NODE））;
   断言（pstNodes！= NULL）;   //创建一个数组，给出了节点的访问顺序
   plDataValues =（长*）malloc的（lNumItems *的sizeof（长））;
   断言（plDataValues！= NULL）;   //访问，以便数据
   对于（i = 0; I＆LT; lNumItems;我++）
      plDataValues [我] =我;   startTimer所（安培T1）;   //循环和访问内存一堆倍
   为（J = 0; J＆LT; lNumItems; J ++）
      {
      对于（i = 0; I＆LT; lNumItems;我++）
         {
         。pstNodes [plDataValues [I] LDATA = I *焦耳;
         }
      }   StopTimer（T1，＆放大器; DMS）;
   的printf（总序时间：％F \\ N，DMS）;   //现在访问随机的顺序排列位置
   ShuffleArray（plDataValues，lNumItems）;   startTimer所（安培T1）;   为（J = 0; J＆LT; lNumItems; J ++）
      {
      对于（i = 0; I＆LT; lNumItems;我++）
         {
         。pstNodes [plDataValues [I] LDATA = I *焦耳;
         }
      }   StopTimer（T1，＆放大器; DMS）;
   的printf（总随机时间：％F \\ N，DMS）;}

解决方案

虽然我不能提供一个答案的数字是否没有意义了（我不能很好地在缓存延迟精通，但备案〜 10周期L1高速缓存未命中听起来是正确的），我可以为您提供 Cachegrind 作为一种工具来帮助你实际看到你的2次测试之间的缓存性能的差异。

Cachegrind是Valgrind的工具型材高速缓存和分支命中/未命中（的框架，权力永远可爱MEMCHECK）。它会给你的缓存多少命中/惦记着你实际上是让你的程序中。一个想法

Edit: For reference purposes (if anyone stumbles across this question), Igor Ostrovsky wrote a great post about cache misses. It discusses several different issues and shows example numbers. End Edit

I did some testing <long story goes here> and am wondering if a performance difference is due to memory cache misses. The following code demonstrates the issue and boils it down to the critical timing portion. The following code has a couple of loops that visit memory in random order and then in ascending address order.

I ran it on an XP machine (compiled with VS2005: cl /O2) and on a Linux box (gcc –Os). Both produced similar times. These times are in milliseconds. I believe all loops are running and are not optimized out (otherwise it would run "instantly").

*** Testing 20000 nodes
Total Ordered Time: 888.822899
Total Random Time: 2155.846268

Do these numbers make sense? Is the difference primarily due to L1 cache misses or is something else going on as well? There are 20,000^2 memory accesses and if every one were a cache miss, that is about 3.2 nanoseconds per miss. The XP (P4) machine I tested on is 3.2GHz and I suspect (but don’t know) has a 32KB L1 cache and 512KB L2. With 20,000 entries (80KB), I assume there is not a significant number of L2 misses. So this would be (3.2*10^9 cycles/second) * 3.2*10^-9 seconds/miss) = 10.1 cycles/miss. That seems high to me. Maybe it’s not, or maybe my math is bad. I tried measuring cache misses with VTune, but I got a BSOD. And now I can’t get it to connect to the license server (grrrr).

typedef struct stItem
{
   long     lData;
   //char     acPad[20];
} LIST_NODE;



#if defined( WIN32 )
void StartTimer( LONGLONG *pt1 )
{
   QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}

void StopTimer( LONGLONG t1, double *pdMS )
{
   LONGLONG t2, llFreq;

   QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
   QueryPerformanceFrequency( (LARGE_INTEGER*)&llFreq );
   *pdMS = ((double)( t2 - t1 ) / (double)llFreq) * 1000.0;
}
#else
// doesn't need 64-bit integer in this case
void StartTimer( LONGLONG *pt1 )
{
   // Just use clock(), this test doesn't need higher resolution
   *pt1 = clock();
}

void StopTimer( LONGLONG t1, double *pdMS )
{
   LONGLONG t2 = clock();
   *pdMS = (double)( t2 - t1 ) / ( CLOCKS_PER_SEC / 1000 );
}
#endif



long longrand()
{
   #if defined( WIN32 )
   // Stupid cheesy way to make sure it is not just a 16-bit rand value
   return ( rand() << 16 ) | rand();
   #else
   return rand();
   #endif
}

// get random value in the given range
int randint( int m, int n )
{
   int ret = longrand() % ( n - m + 1 );
   return ret + m;
}

// I think I got this out of Programming Pearls (Bentley).
void ShuffleArray
(
   long *plShuffle,  // (O) return array of "randomly" ordered integers
   long lNumItems    // (I) length of array
)
{
   long i;
   long j;
   long t;

   for ( i = 0; i < lNumItems; i++ )
      plShuffle[i] = i;

   for ( i = 0; i < lNumItems; i++ )
      {
      j = randint( i, lNumItems - 1 );

      t = plShuffle[i];
      plShuffle[i] = plShuffle[j];
      plShuffle[j] = t;
      }
}



int main( int argc, char* argv[] )
{
   long          *plDataValues;
   LIST_NODE     *pstNodes;
   long          lNumItems = 20000;
   long          i, j;
   LONGLONG      t1;  // for timing
   double dms;

   if ( argc > 1 && atoi(argv[1]) > 0 )
      lNumItems = atoi( argv[1] );

   printf( "\n\n*** Testing %u nodes\n", lNumItems );

   srand( (unsigned int)time( 0 ));

   // allocate the nodes as one single chunk of memory
   pstNodes = (LIST_NODE*)malloc( lNumItems * sizeof( LIST_NODE ));
   assert( pstNodes != NULL );

   // Create an array that gives the access order for the nodes
   plDataValues = (long*)malloc( lNumItems * sizeof( long ));
   assert( plDataValues != NULL );

   // Access the data in order
   for ( i = 0; i < lNumItems; i++ )
      plDataValues[i] = i;

   StartTimer( &t1 );

   // Loop through and access the memory a bunch of times
   for ( j = 0; j < lNumItems; j++ )
      {
      for ( i = 0; i < lNumItems; i++ )
         {
         pstNodes[plDataValues[i]].lData = i * j;
         }
      }

   StopTimer( t1, &dms );
   printf( "Total Ordered Time: %f\n", dms );

   // now access the array positions in a "random" order
   ShuffleArray( plDataValues, lNumItems );

   StartTimer( &t1 );

   for ( j = 0; j < lNumItems; j++ )
      {
      for ( i = 0; i < lNumItems; i++ )
         {
         pstNodes[plDataValues[i]].lData = i * j;
         }
      }

   StopTimer( t1, &dms );
   printf( "Total Random Time: %f\n", dms );

}

解决方案

While I can't offer an answer to whether or not the numbers make sense (I'm not well versed in the cache latencies, but for the record ~10 cycle L1 cache misses sounds about right), I can offer you Cachegrind as a tool to help you actually see the differences in cache performance between your 2 tests.

Cachegrind is a Valgrind tool (the framework that powers the always-lovely memcheck) which profiles cache and branch hits/misses. It will give you an idea of how many cache hits/misses you are actually getting in your program.

这篇关于什么是L1缓存未命中的成本？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

什么是L1缓存未命中的成本？ [英] What is the Cost of an L1 Cache Miss?

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

什么是L1缓存未命中的成本？ [英] What is the Cost of an L1 Cache Miss?

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭