C ++ Cache性能奇怪的行为 [英] C++ Cache performance odd behavior

查看:147
本文介绍了C ++ Cache性能奇怪的行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了一篇文章(1.5年前 http: //www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736 ),其中涉及缓存性能和数据大小。他们显示他们说他们在i7(沙地桥)上运行的以下代码

  static volatile int array [Size] 
static void test_function(void)
{
for(int i = 0; i for(int x = 0; x array [x] ++;
}



他们声称如果他们保持Size * Iterations常数,当数组内存中的大小增加超过L2缓存大小时,它们观察到执行(10x)所花费的时间的巨大峰值。



作为自己的练习,我想尝试这看看我是否可以重现我的机器的结果。 (i7 3770k,win7,visual c ++ 2012编译器,Win32调试模式,没有启用优化)。令我惊讶的是,我不能看到执行所需的时间增加(甚至超过L3缓存大小),这使我认为编译器在某种程度上优化这个代码。但我也看不到任何优化。唯一的速度的变化,我看到,在我的机器的字大小以下需要稍长。以下是我的时间,代码列表和相关的反汇编。



有人知道原因: b

1)为什么所用的时间不增加,而不管我的数组的大小?



2)为什么时间开始高,然后减少直到达到缓存行大小,如果数据小于行大小,则处理更多迭代而不从缓存读取






时间:

  Size = 1,Iterations = 1073741824,Time = 3829 
Size = 2,Iterations = 536870912,Time = 2625
Size = 4,Iterations = 268435456,Time = 2563
Size = 16,Iterations = 67108864,Time = 2906
Size = 32,Iterations = 33554432,Time = 3469
Size = 64,Iterations = 16777216,Time = 3250
Size = 256,Iterations = 4194304,Time = 3140
Size = 1024,Iterations = 1048576,Time = 3110
Size = Iterations = 524288,Time = 3187
Size = 4096,Iterations = 262144,Time = 3078
Size = 8192,Iterations = 131072,Time = 3125
Size = 16384,Iterations = 65536,Time = 3109
Size = 32768,Iterations = 32768,Time = 3078
Size = 65536,Iterations = 16384,Time = 3078
Size = 262144,Iterations = 4096,Time = 3172
Size = 524288,Iterations = 2048,Time = 3109
Size = 1048576,Iterations = 1024,Time = 3094
Size = 2097152,Iterations = 512,Time = 3313
Size = 4194304, Iterations = 256,Time = 3391
Size = 8388608,Iterations = 128,Time = 3312
Size = 33554432,Iterations = 32,Time = 3109
Size = 134217728,Iterations = 8,Time = 3515
Size = 536870912,Iterations = 2,Time = 3532






代码:

  #include< string& 
#include< cassert>
#include< windows.h>

template< unsigned int SIZE,unsigned int ITERATIONS>
static void test_body(volatile char * array)
{
for(unsigned int i = 0; i< ITERATIONS; i ++)
{
for x = 0; x {
array [x] ++;
}
}
}

模板< unsigned int SIZE,unsigned int ITERATIONS>
static void test_function()
{
assert(SIZE * ITERATIONS == 1024 * 1024 * 1024);
static volatile char array [SIZE];

test_body< SIZE,1>(array); // warmup

DWORD beginTime = GetTickCount();
test_body< SIZE,ITERATIONS>(array);
DWORD endTime = GetTickCount();
printf(Size =%u,Iterations =%u,Time =%d\\\
,SIZE,ITERATIONS,endTime-beginTime);
}

int main()
{
enum {eIterations = 1024 * 1024 * 1024}
test_function< 1,eIterations>();
test_function< 2,eIterations / 2>();
test_function< 4,eIterations / 4>();
test_function< 16,eIterations / 16>();
test_function< 32,eIterations / 32>();
test_function< 64,eIterations / 64>();
test_function< 256,eIterations / 256>();
test_function< 1024,eIterations / 1024>();
test_function< 2048,eIterations / 2048>();
test_function< 4096,eIterations / 4096>();
test_function< 8192,eIterations / 8192>();
test_function< 16384,eIterations / 16384>();
test_function< 32768,eIterations / 32768>();
test_function< 65536,eIterations / 65536>();
test_function< 262144,eIterations / 262144>();
test_function< 524288,eIterations / 524288>();
test_function< 1048576,eIterations / 1048576>();
test_function< 2097152,eIterations / 2097152>();
test_function< 4194304,eIterations / 4194304>();
test_function< 8388608,eIterations / 8388608>();
test_function< 33554432,eIterations / 33554432>();
test_function< 134217728,eIterations / 134217728>();
test_function< 536870912,eIterations / 536870912>();
}






拆卸

  for(unsigned int i = 0; i  00281A59 mov dword ptr [ebp-4],0 
00281A60 jmp test_body< 536870912,2> + 1Bh(0281A6Bh)
00281A62 mov eax,dword ptr [ebp-4]
00281A65 add eax,1
00281A68 mov dword ptr [ ebp-4],eax
00281A6B cmp dword ptr [ebp-4],2
00281A6F jae test_body< 536870912,2> + 53h(0281AA3h)
{
for int x = 0; x 00281A71 mov dword ptr [ebp-8],0
00281A78 jmp test_body< 536870912,2> + 33h(0281A83h)
00281A7 mov eax,dword ptr [ebp-8]
{
for(unsigned int x = 0; x< SIZE; x ++)
00281A7D add eax,1
00281A80 mov dword ptr [ebp-8],eax
00281A83 cmp dword ptr [ebp-8],20000000h
00281A8A jae test_body< 536870912,2> + 51h(0281AA1h)
{
array [ x] ++;
00281A8C mov eax,dword ptr [array]
00281A8F add eax,dword ptr [ebp-8]
00281A92 mov cl,byte ptr [eax]
00281A94 add cl,1
00281A97 mov edx,dword ptr [array]
00281A9A add edx,dword ptr [ebp-8]
00281A9D mov byte ptr [edx],cl
}
00281A9F jmp test_body< 536870912,2> + 2Ah(0281A7Ah)
}
00281AA1 jmp test_body< 536870912,2> + 12h(0281A62h)


解决方案

TL; DR:对于缓存延迟或速度,您的测试不正确测试。相反,它通过OoO CPU管线测量了斩波复杂代码的一些问题。



使用适当的测试来衡量缓存和内存延迟: lat_mem_rd from lmbench ;和速度(带宽)测量的正确测试: STREAM基准的内存速度; 从memtest86测试缓存速度,使用 rep movsl 主要操作



此外,在现代(2010和更新版本)桌面/服务器CPU中,在L1和L2附近还内置了硬件预取逻辑缓存,它将检测线性访问模式,并从外部缓存预加载数据到内部,然后你将要求这个数据:英特尔优化手册 - 7.2硬件预取数据,第365页; intel.com blog,2009 。很难禁用所有硬件预取( SO Q / A 1 SO Q / A 2



故事:



我将尝试在Linux中使用 perf 性能监视工具进行类似测试的几个测量(aka perf_events 。代码基于 Joky (32位整数,而不是字符数组)的程序,并分为几个二进制为: a5 用于大小2 ^ 5 = 32; a10 => 2 ^ 10 = 1024(4 KB); a15 => 2 ^ 15 = 32768, a20 (1百万的ints = 4 MB)和 a25 (32百万个ints = 128MB)。 cpu是i7-2600四核Sandy Bridge 3.4 GHz。



让我们从基本的 perf stat 事件集(一些行被跳过)。我选择了2 ^ 10(4 KB)和2 ^ 20(4 MB)

  $ perf stat ./a10 
大小= 1024 ITERATIONS = 1048576,TIME = 2372.09 ms

'./a10'的性能计数器统计信息:

276页错误#0,000 M / sec
8 238 473 169周期#3,499 GHz
4 936 244 310 stalled-cycles-frontend#59,92%前端周期空闲
415 849 629 stalled-cycles-backend#5,05%后端周期空闲
11 832 421 238 instructions#1,44 insns per cycle
#0,42每个insn的停顿周期
1 078 974 782分支机构#458,274 M / sec
1 080 091 branch -misses#0,10%的所有分支

$ perf stat ./a20
Size = 1048576 ITERATIONS = 1024,TIME = 2432.4 ms

性能计数器stats './a20':

2 321页错误#0,001 M / sec
8 487 656 735周期#3,499 GHz
5 184 295 720 stalled-cycles-frontend #61,08%前端循环空闲
663 245 253 stalled-cycles-backend#7,81%后端循环空闲
11 836 712 988指令#1,39 insns每个循环
#0 ,44个停顿周期每个insn
1 077 257 745分支#444,104 M / sec
30 601分支错误#0,00%所有分支

我们可以看到什么?指令计数非常接近(因为大小*迭代是常数),循环计数和时间也接近。这两个例子都有10亿分支,99%的良好预测。但是有非常高的60%的前端失速计数和5-8%的后端。前端停顿是指令获取和解码中的停顿,可能很难说明为什么,但是对于您的代码前端不能解码每个tick的4个指令( 英特尔优化手册 ,第B.3节 - Sandy Bridge的性能调整技术,小节B.3.2分层自上而下的性能表征...)

  $ perf record -e stalled-cycles-frontend ./a20 
Size = 1048576 ITERATIONS = 1024,TIME = 2477.65 ms
[perf record:唤醒1次写入数据]
[ perf record:Captured and written 0.097 MB perf.data(〜4245 samples)]
$ perf annotate -d a20 | cat
Percent |源代码&拆卸a20
------------------------------------------- -----
:08048e6f< void test_body< 1048576u,1024u>(int volatile *)> ;:

10.43:8048e87:mov -0x8(%ebp),%eax
1.10:8048e8a:lea 0x0(,%eax,4),%edx
0.16:8048e91:mov 0x8(%ebp),%eax
0.78:8048e94:add%edx, eax
6.87:8048e96:mov(%eax),%edx
52.53:8048e98:add $ 0x1,%edx
9.89:8048e9b:mov%edx,(%eax)
14.15:8048e9d:addl $ 0x1,-0x8(%ebp)
2.66:8048ea1:mov -0x8(%ebp),%eax
1.39:8048ea4:cmp $ 0xfffff,%eax $ b b

或使用原始操作码( objdump -d 一些具有相当复杂的索引,因此它们不能由3个简单的解码器处理,并且等待唯一的复杂解码器(图像存在: http://www.realworldtech.com/sandy-bridge/4/

  8048e87:8b 45 f8 mov -0x8(%ebp),%eax 
8048e8a:8d 14 85 00 00 00 00 lea 0x0(,%eax,4),%edx
8048e91:8b 45 08 mov 0x8(%ebp),%eax
8048e94:01 d0 add%edx,%eax
8048e96:8b 10 mov(%eax),%edx
8048e98:83 c2 01 add $ 0x1,%edx
8048e9b:89 10 mov%edx,(%eax)
8048e9d:83 45 f8 01 addl $ 0x1,-0x8(%ebp)
8048ea1:8b 45 f8 mov -0x8(%ebp),%eax
8048ea4:3d ff ff 0f 00 cmp $ 0xfffff,%eax

后端暂停是通过等待内存或缓存(您在测量缓存时感兴趣的内容)和内部执行内核暂停创建的暂停:

  $ perf record -e stalled-cycles-backend ./a20 
Size = 1048576 ITERATIONS = 1024,TIME = 2480.09 ms
[perf record:Woken up 1 times写数据]
[perf record:Captured and written 0.095 MB perf.data(〜4149 samples)]
$ perf annotate -d a20 | cat
4.25:8048e96:mov(%eax )%edx
58.68:8048e98:add $ 0x1,%edx
8.86:8048e9b:mov%edx,(%eax)
3.94:8048e9d:addl $ 0x1,-0x8 ebp)
7.66:8048ea1:mov -0x8(%ebp),%eax
7.40:8048ea4:cmp $ 0xfffff,%eax

大多数后端停顿报告为添加0x1,%edx ,因为它是数据的消费者,从数组加载上一个命令。对于存储到数组,它们占后端停止的70%,或者如果对于程序中的后端停止部分(7%)乘以所有停顿的5%,则乘以。或者换句话说,缓存比您的程序更快。现在我们可以回答你的第一个问题:


为什么所需的时间不增加,而不管我的数组大小? p>

你测试是那么糟糕(未优化),你试图测量缓存,但他们只有5%的总运行时间减速。



自定义 perf stat 您的测试非常不稳定(噪音)我们也可以测量缓存请求丢失比率。对于4 KB程序L1数据高速缓存提供99,99%的所有负载和99,999%的所有商店。我们可以注意到,您的错误测试生成的缓存请求数比在数组上遍历并增加每个元素(10亿加载+ 10亿个存储)需要更多的请求。额外的访问是用于处理局部变量如 x ,他们总是由缓存服务,因为他们的地址是不变的)

  $ perf stat -e'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses'./a10 
大小= 1024 ITERATIONS = 1048576,TIME = 2412.25 ms

'./a10'的性能计数器统计信息:

5 375 195 765 L1-dcache-loads
364 140 L1-dcache-load-misses#所有L1-dcache命中的0,01%
2 151 408 053 L1-dcache-stores
13 350 L1-dcache-store-misses

对于4 MB程序命中率是很多倍。 100倍错过!现在,所有内存请求的1.2%不是由L1而是由L2提供。

  $ perf stat -e'L1-dcache-loads ,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses'./a20 
Size = 1048576ITERATIONS = 1024,TIME = 2443.92 ms

'./a20'的计数器统计信息:

5 378 035 007 L1-dcache-loads
67 725 008 L1-dcache-load-misses#所有L1-dcache的1,26% hits
2 152 183 588 L1-dcache-stores
67 266 426 L1-dcache-store-misses

这不是一种情况,当我们想注意缓存延迟如何从从4 cpu滴答到12 (3倍),并且当这种改变仅影响1.2%的高速缓存请求时,以及当我们的程序只有7%的低速对高速缓存延迟敏感时



如果我们使用更大的数据集怎么办?好的,这里是a25(2 ^ 25的4字节ints = 128 MB,是缓存大小的几倍):

  $ perf stat ./a25 
Size = 134217728 ITERATIONS = 8,TIME = 2437.25 ms

'./a25'的性能计数器统计信息:

262 417页-faults#0,090 M / sec
10 214 588 827周期#3,499 GHz
6 272 114 853 stalled-cycles-frontend#61,40%前端循环空闲
1 098 632 880 stalled- cycles-backend#10,76%后端循环空闲
13 683 671 982指令#每周期1,34个insns
#0,46每个insn的停止循环
1 274 410 549 branches#436,519 M / sec
315 656 branch-misses#0,02%of all branches

$ perf stat -e'L1-dcache-loads,L1-dcache-load-misses, dcache-stores,L1-dcache-store-misses'./a25
Size = 134217728 ITERATIONS = 8,TIME = 2444.13 ms

'./a25'的性能计数器统计信息:

6 138 410 226 L1-dcache-loads
77 025 747 L1-dcache-load-misses#所有L1-dcache命中的1,25%
2 515 141 824 L1- dcache-stores
76 320 695 L1-dcache-store-misses

几乎相同的L1丢失率和更多的后端停顿。我能够获得缓存引用,缓存未命中事件的统计数据,我建议他们是关于L3缓存(有多次请求L2):



<$ pre> $ perf stat -e'cache-references,cache-misses'./a25
Size = 134217728 ITERATIONS = 8,TIME = 2440.71 ms

'./a25'的性能计数器统计信息:

17 053 482缓存引用
11 829 118 cache-misses#所有缓存引用中的69,365%

所以,错误率很高,但是测试有10亿(有用)负载,其中只有0.08亿错过L1。 0.01亿的请求由内存提供。内存延迟约为 230 cpu时钟,而不是4时钟L1延迟。是测试能看到这个吗?可能,如果噪音低。


I read an article (1.5 years old http://www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736) which talks about cache performance and size of data. They show the following code which they say they ran on an i7 (sandy bridge)

static volatile int array[Size];
static void test_function(void)
{
    for (int i = 0; i < Iterations; i++)
        for (int x = 0; x < Size; x++)
          array[x]++;
}

They make the claim that if they keep Size*Iterations constant, increasing Size, when the size in memory of array increases beyond the L2 cache size they observe a huge spike in time taken to execute (10x).

As an exercise for myself I wanted to try this to see if I could reproduce their results for my machine . (i7 3770k, win7, visual c++ 2012 compiler, Win32 debug mode, no optimizations enabled). To my surprise though, I am not able to see an increase in time taken to execute (even beyond the L3 cache size) which made me think the compiler was somehow optimizing this code. But I dont see any optimizations either. The only change in speed i see is that below the word size of my machine it takes slightly longer. Below are my timings, code listing, and pertinent disassembly.

Does anyone know why:

1) Why the time taken does not increase at all regardless of the size of my array? Or how I could find out?

2) Why does the time taken start high and then decrease until the cache line size is reached, shouldn't more iterations be processed without reading from cache if the data is less than the line size?


Timings:

Size=1,Iterations=1073741824, Time=3829
Size=2,Iterations=536870912, Time=2625
Size=4,Iterations=268435456, Time=2563
Size=16,Iterations=67108864, Time=2906
Size=32,Iterations=33554432, Time=3469
Size=64,Iterations=16777216, Time=3250
Size=256,Iterations=4194304, Time=3140
Size=1024,Iterations=1048576, Time=3110
Size=2048,Iterations=524288, Time=3187
Size=4096,Iterations=262144, Time=3078
Size=8192,Iterations=131072, Time=3125
Size=16384,Iterations=65536, Time=3109
Size=32768,Iterations=32768, Time=3078
Size=65536,Iterations=16384, Time=3078
Size=262144,Iterations=4096, Time=3172
Size=524288,Iterations=2048, Time=3109
Size=1048576,Iterations=1024, Time=3094
Size=2097152,Iterations=512, Time=3313
Size=4194304,Iterations=256, Time=3391
Size=8388608,Iterations=128, Time=3312
Size=33554432,Iterations=32, Time=3109
Size=134217728,Iterations=8, Time=3515
Size=536870912,Iterations=2, Time=3532


code:

#include <string>
#include <cassert>
#include <windows.h>

template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_body(volatile char* array)
{
     for (unsigned int i = 0; i < ITERATIONS; i++)
    {
        for (unsigned int  x = 0; x < SIZE; x++)
        {
            array[x]++;
        }
    }
}

template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_function()
{
    assert(SIZE*ITERATIONS == 1024*1024*1024);
    static volatile char array[SIZE];

    test_body<SIZE, 1>(array); //warmup

    DWORD beginTime = GetTickCount();
    test_body<SIZE, ITERATIONS>(array); 
    DWORD endTime= GetTickCount();
    printf("Size=%u,Iterations=%u, Time=%d\n", SIZE,ITERATIONS, endTime-beginTime);
}

int main()
{
    enum { eIterations= 1024*1024*1024};
    test_function<1, eIterations>();
    test_function<2, eIterations/2>();
    test_function<4, eIterations/4>();
    test_function<16, eIterations/16>();
    test_function<32, eIterations/ 32>();
    test_function<64, eIterations/ 64>();
    test_function<256, eIterations/ 256>();
    test_function<1024, eIterations/ 1024>();
    test_function<2048, eIterations/ 2048>();
    test_function<4096, eIterations/ 4096>();
    test_function<8192, eIterations/ 8192>();
    test_function<16384, eIterations/ 16384>();
    test_function<32768, eIterations/ 32768>();
    test_function<65536, eIterations/ 65536>();
    test_function<262144, eIterations/ 262144>();
    test_function<524288, eIterations/ 524288>();
    test_function<1048576, eIterations/ 1048576>();
    test_function<2097152, eIterations/ 2097152>();
    test_function<4194304, eIterations/ 4194304>();
    test_function<8388608, eIterations/ 8388608>();
    test_function<33554432, eIterations/ 33554432>();
    test_function<134217728, eIterations/ 134217728>();
    test_function<536870912, eIterations/ 536870912>();
}


Disassembly

    for (unsigned int i = 0; i < ITERATIONS; i++)
00281A59  mov         dword ptr [ebp-4],0  
00281A60  jmp         test_body<536870912,2>+1Bh (0281A6Bh)  
00281A62  mov         eax,dword ptr [ebp-4]  
00281A65  add         eax,1  
00281A68  mov         dword ptr [ebp-4],eax  
00281A6B  cmp         dword ptr [ebp-4],2  
00281A6F  jae         test_body<536870912,2>+53h (0281AA3h)  
    {
        for (unsigned int  x = 0; x < SIZE; x++)
00281A71  mov         dword ptr [ebp-8],0  
00281A78  jmp         test_body<536870912,2>+33h (0281A83h)  
00281A7A  mov         eax,dword ptr [ebp-8]  
    {
        for (unsigned int  x = 0; x < SIZE; x++)
00281A7D  add         eax,1  
00281A80  mov         dword ptr [ebp-8],eax  
00281A83  cmp         dword ptr [ebp-8],20000000h  
00281A8A  jae         test_body<536870912,2>+51h (0281AA1h)  
        {
            array[x]++;
00281A8C  mov         eax,dword ptr [array]  
00281A8F  add         eax,dword ptr [ebp-8]  
00281A92  mov         cl,byte ptr [eax]  
00281A94  add         cl,1  
00281A97  mov         edx,dword ptr [array]  
00281A9A  add         edx,dword ptr [ebp-8]  
00281A9D  mov         byte ptr [edx],cl  
        }
00281A9F  jmp         test_body<536870912,2>+2Ah (0281A7Ah)  
    }
00281AA1  jmp         test_body<536870912,2>+12h (0281A62h)  

解决方案

TL;DR: Your test is not correct test for cache latency or speed. Instead it measures some problems of chopping complex code through OoO CPU pipeline.

Use right tests for measuring cache and memory latency: lat_mem_rd from lmbench; and right tests for speed (bandwidth) measurements: STREAM benchmark for memory speed; tests from memtest86 for cache speed with rep movsl main operation)

Also, in modern (2010 and newer) desktop/sever CPUs there is hardware prefetch logic built in near L1 and L2 caches which will detect linear access pattern and preload data from outer caches into inner before you will ask for this data: Intel Optimization Manual - 7.2 Hardware prefetching of data, page 365; intel.com blog, 2009. It is hard to disable all hardware prefetches (SO Q/A 1, SO Q/A 2)

Long story:

I will try to do several measurements of similar test with perf performance monitoring tool in Linux (aka perf_events). The code is based on program from Joky (array of 32-bit ints, not of chars), and was separated into several binaries as: a5 is for size 2^5 = 32; a10 => 2^10 = 1024 (4 KB); a15 => 2^15 = 32768, a20 (1 million of ints = 4 MB) and a25 (32 millions of ints = 128MB). The cpu is i7-2600 quad-core Sandy Bridge 3.4 GHz.

Let's start with basic perf stat with default event set (some lines are skipped). I selected 2^10 (4 KB) and 2^20 (4 MB)

$ perf stat ./a10
Size=1024 ITERATIONS=1048576, TIME=2372.09 ms

 Performance counter stats for './a10':

               276 page-faults               #    0,000 M/sec
     8 238 473 169 cycles                    #    3,499 GHz
     4 936 244 310 stalled-cycles-frontend   #   59,92% frontend cycles idle
       415 849 629 stalled-cycles-backend    #    5,05% backend  cycles idle
    11 832 421 238 instructions              #    1,44  insns per cycle
                                             #    0,42  stalled cycles per insn
     1 078 974 782 branches                  #  458,274 M/sec
         1 080 091 branch-misses             #    0,10% of all branches

$ perf stat ./a20
Size=1048576 ITERATIONS=1024, TIME=2432.4 ms

 Performance counter stats for './a20':

             2 321 page-faults               #    0,001 M/sec
     8 487 656 735 cycles                    #    3,499 GHz
     5 184 295 720 stalled-cycles-frontend   #   61,08% frontend cycles idle
       663 245 253 stalled-cycles-backend    #    7,81% backend  cycles idle
    11 836 712 988 instructions              #    1,39  insns per cycle
                                             #    0,44  stalled cycles per insn
     1 077 257 745 branches                  #  444,104 M/sec
            30 601 branch-misses             #    0,00% of all branches

What we can see here? Instruction counts are very close (because Size*Iterations is constant), cycle count and time are close too. Both examples have 1 billion branches with 99% good prediction. But there is very high 60% stall count for frontend and 5-8% for backend. Frontend stalls are stalls in the instruction fetch and decode, it can be hard to tell why, but for your code frontend can't decode 4 instructions per tick (page B-41 of Intel optimisation manual, section B.3 - "Performance tuning techniques for ... Sandy Bridge", subsection B.3.2 Hierarchical Top-Down Performance Characterization ...)

$ perf record -e stalled-cycles-frontend ./a20
Size=1048576 ITERATIONS=1024, TIME=2477.65 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.097 MB perf.data (~4245 samples) ]
$ perf annotate -d a20|cat
 Percent |      Source code & Disassembly of a20
------------------------------------------------
         :      08048e6f <void test_body<1048576u, 1024u>(int volatile*)>:

   10.43 :       8048e87:       mov    -0x8(%ebp),%eax
    1.10 :       8048e8a:       lea    0x0(,%eax,4),%edx
    0.16 :       8048e91:       mov    0x8(%ebp),%eax
    0.78 :       8048e94:       add    %edx,%eax
    6.87 :       8048e96:       mov    (%eax),%edx
   52.53 :       8048e98:       add    $0x1,%edx
    9.89 :       8048e9b:       mov    %edx,(%eax)
   14.15 :       8048e9d:       addl   $0x1,-0x8(%ebp)
    2.66 :       8048ea1:       mov    -0x8(%ebp),%eax
    1.39 :       8048ea4:       cmp    $0xfffff,%eax

Or here with raw opcodes (objdump -d), some have rather complicated indexing, so possible they can't be handled by 3 simple decoders and waits for the only complex decoder (image is there: http://www.realworldtech.com/sandy-bridge/4/)

 8048e87:       8b 45 f8                mov    -0x8(%ebp),%eax
 8048e8a:       8d 14 85 00 00 00 00    lea    0x0(,%eax,4),%edx
 8048e91:       8b 45 08                mov    0x8(%ebp),%eax
 8048e94:       01 d0                   add    %edx,%eax
 8048e96:       8b 10                   mov    (%eax),%edx
 8048e98:       83 c2 01                add    $0x1,%edx
 8048e9b:       89 10                   mov    %edx,(%eax)
 8048e9d:       83 45 f8 01             addl   $0x1,-0x8(%ebp)
 8048ea1:       8b 45 f8                mov    -0x8(%ebp),%eax
 8048ea4:       3d ff ff 0f 00          cmp    $0xfffff,%eax

Backend stalls are stalls created by waiting for memory or cache (the thing you are interested in when measuring caches) and by internal execution core stalls:

$ perf record -e stalled-cycles-backend ./a20
Size=1048576 ITERATIONS=1024, TIME=2480.09 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.095 MB perf.data (~4149 samples) ]
$ perf annotate -d a20|cat
    4.25 :       8048e96:       mov    (%eax),%edx
   58.68 :       8048e98:       add    $0x1,%edx
    8.86 :       8048e9b:       mov    %edx,(%eax)
    3.94 :       8048e9d:       addl   $0x1,-0x8(%ebp)
    7.66 :       8048ea1:       mov    -0x8(%ebp),%eax
    7.40 :       8048ea4:       cmp    $0xfffff,%eax

Most backend stalls are reported for add 0x1,%edx because it is the consumer of data, loaded from the array in previous command. With store to array they account for 70% of backend stalls, or if we multiply if for total backend stall portion in the program (7%), for the 5% of all stalls. Or in other words, the cache is faster than your program. Now we can answer to your first question:

Why the time taken does not increase at all regardless of the size of my array?

You test is so bad (not optimized), that you are trying to measure caches, but they have only 5% slowdown on total run time. Your test is so unstable (noisy) that you will not see this 5% effect.

With custom perf stat runs we also can measure cache request-to-miss ratio. For 4 KB program L1 data cache serves 99,99% of all loads and 99,999% of all stores. We can note that your incorrect test generate several times more requests to cache than it is needed to walk on array and increment every element (1 billion loads + 1 billion stores). Additional accesses are for working with local variables like x, they always served by cache because their address is constant)

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a10
Size=1024 ITERATIONS=1048576, TIME=2412.25 ms

 Performance counter stats for './a10':

     5 375 195 765 L1-dcache-loads
           364 140 L1-dcache-load-misses     #    0,01% of all L1-dcache hits
     2 151 408 053 L1-dcache-stores
            13 350 L1-dcache-store-misses

For 4 MB program hit rate is many-many times worse. 100 times more misses! Now 1.2 % of all memory requests are served not by L1 but L2.

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a20
Size=1048576 ITERATIONS=1024, TIME=2443.92 ms

 Performance counter stats for './a20':

     5 378 035 007 L1-dcache-loads
        67 725 008 L1-dcache-load-misses     #    1,26% of all L1-dcache hits
     2 152 183 588 L1-dcache-stores
        67 266 426 L1-dcache-store-misses

Isn't it a case when we want to notice how cache latency goes from 4 cpu ticks up to 12 (3 times longer), and when this change affects only 1.2% of cache requests, and when our program has only 7% slowdown sensitive to the cache latencies ???

What if we will use bigger data set? Ok, here is a25 (2^25 of 4-byte ints = 128 MB, several times more than cache size):

$ perf stat   ./a25
Size=134217728 ITERATIONS=8, TIME=2437.25 ms

 Performance counter stats for './a25':

           262 417 page-faults               #    0,090 M/sec
    10 214 588 827 cycles                    #    3,499 GHz
     6 272 114 853 stalled-cycles-frontend   #   61,40% frontend cycles idle
     1 098 632 880 stalled-cycles-backend    #   10,76% backend  cycles idle
    13 683 671 982 instructions              #    1,34  insns per cycle
                                             #    0,46  stalled cycles per insn
     1 274 410 549 branches                  #  436,519 M/sec
           315 656 branch-misses             #    0,02% of all branches

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a25
Size=134217728 ITERATIONS=8, TIME=2444.13 ms

 Performance counter stats for './a25':

     6 138 410 226 L1-dcache-loads
        77 025 747 L1-dcache-load-misses     #    1,25% of all L1-dcache hits
     2 515 141 824 L1-dcache-stores
        76 320 695 L1-dcache-store-misses

Almost the same L1 miss rate, and more backend stalls. I was able to get stats on "cache-references,cache-misses" events ans I suggest they are about L3 cache (there is several times more requests to L2):

$ perf stat -e 'cache-references,cache-misses' ./a25
Size=134217728 ITERATIONS=8, TIME=2440.71 ms

 Performance counter stats for './a25':

        17 053 482 cache-references
        11 829 118 cache-misses              #   69,365 % of all cache refs

So, miss rate is high, but the test does 1 billion of (useful) loads, and only 0.08 billion of them misses L1. 0.01 billion of requests are served by memory. Memory latency is around 230 cpu clocks instead of 4 clock L1 latency. Is the test able to see this? May be, if the noise is low.

这篇关于C ++ Cache性能奇怪的行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆