系统上的缓存大小估计? [英] Cache size estimation on your system?
问题描述
我从这个链接 (https://gist.github.com/jiewmeng 得到了这个程序/3787223.我一直在网上搜索以获得更好地了解处理器缓存(L1 和 L2)的想法.我希望能够编写一个程序,使我能够猜测 L1 的大小和我的新笔记本电脑上的 L2 缓存.(仅用于学习目的.我知道我可以查看规格.)
I got this program from this link (https://gist.github.com/jiewmeng/3787223).I have been searching the web with the idea of gaining a better understanding of processor caches (L1 and L2).I want to be able to write a program that would enable me to guess the size of L1 and L2 cache on my new Laptop.(just for learning purpose.I know I could check the spec.)
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define KB 1024
#define MB 1024 * 1024
int main() {
unsigned int steps = 256 * 1024 * 1024;
static int arr[4 * 1024 * 1024];
int lengthMod;
unsigned int i;
double timeTaken;
clock_t start;
int sizes[] = {
1 * KB, 4 * KB, 8 * KB, 16 * KB, 32 * KB, 64 * KB, 128 * KB, 256 * KB,
512 * KB, 1 * MB, 1.5 * MB, 2 * MB, 2.5 * MB, 3 * MB, 3.5 * MB, 4 * MB
};
int results[sizeof(sizes)/sizeof(int)];
int s;
/*for each size to test for ... */
for (s = 0; s < sizeof(sizes)/sizeof(int); s++)
{
lengthMod = sizes[s] - 1;
start = clock();
for (i = 0; i < steps; i++)
{
arr[(i * 16) & lengthMod] *= 10;
arr[(i * 16) & lengthMod] /= 10;
}
timeTaken = (double)(clock() - start)/CLOCKS_PER_SEC;
printf("%d, %.8f
", sizes[s] / 1024, timeTaken);
}
return 0;
}
我机器上的程序输出如下.我该如何解释这些数字?这个程序告诉我什么.?
The output of the program in my machine is as follows.How do I interpret the numbers? What does this program tell me.?
1, 1.07000000
4, 1.04000000
8, 1.06000000
16, 1.13000000
32, 1.14000000
64, 1.17000000
128, 1.20000000
256, 1.21000000
512, 1.19000000
1024, 1.23000000
1536, 1.23000000
2048, 1.46000000
2560, 1.21000000
3072, 1.45000000
3584, 1.47000000
4096, 1.94000000
推荐答案
您需要直接访问内存
我的意思不是 DMA 传输.CPU当然必须访问内存(否则你不会测量CACHEs),但尽可能直接......所以测量可能不会很准确Windows/Linux,因为服务和其他进程可能会在运行时干扰缓存.多次测量并取平均值以获得更好的结果(或使用最快的时间或一起过滤).为获得最佳准确性,请使用 DOS 和 asm 例如
I am not meaning DMA transfer by this. Memory must be accessed by CPU of course (otherwise you are not measuring CACHEs) but as directly as it can be ... so measurements will probably not be very accurate on Windows/Linux because services and other processes can mess with caches during runtime. Measure many times and average for better results (or use the fastest time or filter it together). For best accuracy use DOS and asm for example
rep + movsb,movsw,movsd
rep + stosb,stosw,stosd
所以你测量的是内存传输,而不是你的代码中的其他东西!!!
测量原始传输时间并绘制图表
x
轴是传输块大小y
轴是传输速度
x
axis is transfer block sizey
axis is transfer speed
传输速率相同的区域与适当的CACHE层一致
zones with the same transfer rate are consistent with appropriate CACHE layer
[Edit1] 找不到我的旧源代码,所以我现在在 C++ 中为 windows 破坏了一些东西:
could not find my old source code for this so I busted something right now in C++ for windows:
时间测量:
//---------------------------------------------------------------------------
double performance_Tms=-1.0, // perioda citaca [ms]
performance_tms= 0.0; // zmerany cas [ms]
//---------------------------------------------------------------------------
void tbeg()
{
LARGE_INTEGER i;
if (performance_Tms<=0.0) { QueryPerformanceFrequency(&i); performance_Tms=1000.0/double(i.QuadPart); }
QueryPerformanceCounter(&i); performance_tms=double(i.QuadPart);
}
//---------------------------------------------------------------------------
double tend()
{
LARGE_INTEGER i;
QueryPerformanceCounter(&i); performance_tms=double(i.QuadPart)-performance_tms; performance_tms*=performance_Tms;
return performance_tms;
}
//---------------------------------------------------------------------------
基准测试(32 位应用):
//---------------------------------------------------------------------------
DWORD sizes[]= // used transfer block sizes
{
1<<10, 2<<10, 3<<10, 4<<10, 5<<10, 6<<10, 7<<10, 8<<10, 9<<10,
10<<10, 11<<10, 12<<10, 13<<10, 14<<10, 15<<10, 16<<10, 17<<10, 18<<10,
19<<10, 20<<10, 21<<10, 22<<10, 23<<10, 24<<10, 25<<10, 26<<10, 27<<10,
28<<10, 29<<10, 30<<10, 31<<10, 32<<10, 48<<10, 64<<10, 80<<10, 96<<10,
112<<10,128<<10,192<<10,256<<10,320<<10,384<<10,448<<10,512<<10, 1<<20,
2<<20, 3<<20, 4<<20, 5<<20, 6<<20, 7<<20, 8<<20, 9<<20, 10<<20,
11<<20, 12<<20, 13<<20, 14<<20, 15<<20, 16<<20, 17<<20, 18<<20, 19<<20,
20<<20, 21<<20, 22<<20, 23<<20, 24<<20, 25<<20, 26<<20, 27<<20, 28<<20,
29<<20, 30<<20, 31<<20, 32<<20,
};
const int N=sizeof(sizes)>>2; // number of used sizes
double pmovsd[N]; // measured transfer rate rep MOVSD [MB/sec]
double pstosd[N]; // measured transfer rate rep STOSD [MB/sec]
//---------------------------------------------------------------------------
void measure()
{
int i;
BYTE *dat; // pointer to used memory
DWORD adr,siz,num; // local variables for asm
double t,t0;
HANDLE hnd; // process handle
// enable priority change (huge difference)
#define measure_priority
// enable critical sections (no difference)
// #define measure_lock
for (i=0;i<N;i++) pmovsd[i]=0.0;
for (i=0;i<N;i++) pstosd[i]=0.0;
dat=new BYTE[sizes[N-1]+4]; // last DWORD +4 Bytes (should be 3 but i like 4 more)
if (dat==NULL) return;
#ifdef measure_priority
hnd=GetCurrentProcess(); if (hnd!=NULL) { SetPriorityClass(hnd,REALTIME_PRIORITY_CLASS); CloseHandle(hnd); }
Sleep(200); // wait to change take effect
#endif
#ifdef measure_lock
CRITICAL_SECTION lock; // lock handle
InitializeCriticalSectionAndSpinCount(&lock,0x00000400);
EnterCriticalSection(&lock);
#endif
adr=(DWORD)(dat);
for (i=0;i<N;i++)
{
siz=sizes[i]; // siz = actual block size
num=(8<<20)/siz; // compute n (times to repeat the measurement)
if (num<4) num=4;
siz>>=2; // size / 4 because of 32bit transfer
// measure overhead
tbeg(); // start time meassurement
asm {
push esi
push edi
push ecx
push ebx
push eax
mov ebx,num
mov al,0
loop0: mov esi,adr
mov edi,adr
mov ecx,siz
// rep movsd // es,ds already set by C++
// rep stosd // es already set by C++
dec ebx
jnz loop0
pop eax
pop ebx
pop ecx
pop edi
pop esi
}
t0=tend(); // stop time meassurement
// measurement 1
tbeg(); // start time meassurement
asm {
push esi
push edi
push ecx
push ebx
push eax
mov ebx,num
mov al,0
loop1: mov esi,adr
mov edi,adr
mov ecx,siz
rep movsd // es,ds already set by C++
// rep stosd // es already set by C++
dec ebx
jnz loop1
pop eax
pop ebx
pop ecx
pop edi
pop esi
}
t=tend(); // stop time meassurement
t-=t0; if (t<1e-6) t=1e-6; // remove overhead and avoid division by zero
t=double(siz<<2)*double(num)/t; // Byte/ms
pmovsd[i]=t/(1.024*1024.0); // MByte/s
// measurement 2
tbeg(); // start time meassurement
asm {
push esi
push edi
push ecx
push ebx
push eax
mov ebx,num
mov al,0
loop2: mov esi,adr
mov edi,adr
mov ecx,siz
// rep movsd // es,ds already set by C++
rep stosd // es already set by C++
dec ebx
jnz loop2
pop eax
pop ebx
pop ecx
pop edi
pop esi
}
t=tend(); // stop time meassurement
t-=t0; if (t<1e-6) t=1e-6; // remove overhead and avoid division by zero
t=double(siz<<2)*double(num)/t; // Byte/ms
pstosd[i]=t/(1.024*1024.0); // MByte/s
}
#ifdef measure_lock
LeaveCriticalSection(&lock);
DeleteCriticalSection(&lock);
#endif
#ifdef measure_priority
hnd=GetCurrentProcess(); if (hnd!=NULL) { SetPriorityClass(hnd,NORMAL_PRIORITY_CLASS); CloseHandle(hnd); }
#endif
delete dat;
}
//---------------------------------------------------------------------------
其中数组 pmovsd[]
和 pstosd[]
保存测量的 32bit
传输速率 [MByte/sec]
.您可以通过在测量功能开始时使用/rem 两个定义来配置代码.
Where arrays pmovsd[]
and pstosd[]
holds the measured 32bit
transfer rates [MByte/sec]
. You can configure the code by use/rem two defines at the start of measure function.
图形输出:
为了最大限度地提高准确性,您可以将过程优先级更改为最大值.因此,创建具有最大优先级的测量线程(我尝试过,但实际上它把事情搞砸了)并向其中添加了临界区,这样操作系统就不会经常中断测试(带螺纹和不带螺纹没有明显区别).如果您想使用 Byte
传输,请考虑它仅使用 16bit
寄存器,因此您需要添加循环和地址迭代.
To maximize accuracy you can change process priority class to maximum. So create measure thread with max priority (I try it but it mess thing up actually) and add critical section to it so the test will not be uninterrupted by OS as often (no visible difference with and without threads). If you want to use Byte
transfers then take account that it uses only 16bit
registers so you need to add loop and address iterations.
附注.
如果您在笔记本电脑上尝试此操作,那么您应该使 CPU 过热,以确保您测量的最高 CPU/内存 速度.所以没有 Sleep
s.测量前的一些愚蠢的循环会这样做,但它们应该至少运行几秒钟.您也可以通过 CPU 频率测量和上升时循环来同步它.饱和后停止......
If you try this on notebook then you should overheat the CPU to be sure that you measure on top CPU/Mem speed. So no Sleep
s. Some stupid loops before measurement will do it but they should run at least few seconds. Also you can synchronize this by CPU frequency measurement and loop while is rising. Stop after it saturates ...
asm 指令 RDTSC
最适合于此(但请注意,随着新架构的出现,其含义略有变化).
asm instruction RDTSC
is best for this (but beware its meaning has slightly changed with new architectures).
如果您不在 Windows 下,请将功能 tbeg,tend
更改为您的 OS 等效项
If you are not under Windows then change functions tbeg,tend
to your OS equivalents
[edit2] 进一步提高准确性
在最终解决了 VCL 影响测量精度的问题之后,我发现这个问题和更多关于它的信息 此处,为了提高准确性,您可以在基准测试之前执行以下操作:
Well after finally solving problem with VCL affecting measurement accuracy which I discover thanks to this question and more about it here, to improve accuracy you can prior to benchmark do this:
设置进程优先级为
realtime
将进程关联设置为单个 CPU
所以您只测量多核上的单个 CPU
so you measure just single CPU on multi-core
刷新数据和指令缓存
例如:
// before mem benchmark
DWORD process_affinity_mask=0;
DWORD system_affinity_mask =0;
HANDLE hnd=GetCurrentProcess();
if (hnd!=NULL)
{
// priority
SetPriorityClass(hnd,REALTIME_PRIORITY_CLASS);
// affinity
GetProcessAffinityMask(hnd,&process_affinity_mask,&system_affinity_mask);
process_affinity_mask=1;
SetProcessAffinityMask(hnd,process_affinity_mask);
GetProcessAffinityMask(hnd,&process_affinity_mask,&system_affinity_mask);
}
// flush CACHEs
for (DWORD i=0;i<sizes[N-1];i+=7)
{
dat[i]+=i;
dat[i]*=i;
dat[i]&=i;
}
// after mem benchmark
if (hnd!=NULL)
{
SetPriorityClass(hnd,NORMAL_PRIORITY_CLASS);
SetProcessAffinityMask(hnd,system_affinity_mask);
}
所以更准确的测量看起来像这样:
So the more accurate measurement looks like this:
这篇关于系统上的缓存大小估计?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!