编写程序以获取CPU缓存的大小和级别 [英] Write a program to get CPU cache sizes and levels

查看:148
本文介绍了编写程序以获取CPU缓存的大小和级别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写一个程序来获取我的缓存大小(L1,L2,L3)。我知道它的一般概念。

I want to write a program to get my cache size(L1, L2, L3). I know the general idea of it.


  1. 分配一个大数组

  2. 访问其中的一部分

所以我写了一个小程序。
这是我的代码:

So I wrote a little program. Here's my code:

#include <cstdio>
#include <time.h>
#include <sys/mman.h>

const int KB = 1024;
const int MB = 1024 * KB;
const int data_size = 32 * MB;
const int repeats = 64 * MB;
const int steps = 8 * MB;
const int times = 8;

long long clock_time() {
    struct timespec tp;
    clock_gettime(CLOCK_REALTIME, &tp);
    return (long long)(tp.tv_nsec + (long long)tp.tv_sec * 1000000000ll);
}

int main() {
    // allocate memory and lock
    void* map = mmap(NULL, (size_t)data_size, PROT_READ | PROT_WRITE, 
                     MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
    if (map == MAP_FAILED) {
        return 0;
    }
    int* data = (int*)map;

    // write all to avoid paging on demand
    for (int i = 0;i< data_size / sizeof(int);i++) {
        data[i]++;
    }

    int steps[] = { 1*KB, 4*KB, 8*KB, 16*KB, 24 * KB, 32*KB, 64*KB, 128*KB, 
                    128*KB*2, 128*KB*3, 512*KB, 1 * MB, 2 * MB, 3 * MB, 4 * MB, 
                    5 * MB, 6 * MB, 7 * MB, 8 * MB, 9 * MB};
    for (int i = 0; i <= sizeof(steps) / sizeof(int) - 1; i++) {
        double totalTime = 0;    
        for (int k = 0; k < times; k++) {
            int size_mask = steps[i] / sizeof(int) - 1;
            long long start = clock_time();
            for (int j = 0; j < repeats; j++) {
                ++data[ (j * 16) & size_mask ];
            }
            long long end = clock_time();
            totalTime += (end - start) / 1000000000.0;
        }
        printf("%d time: %lf\n", steps[i] / KB, totalTime);
    }
    munmap(map, (size_t)data_size);
    return 0;
}

但是,结果太奇怪了:

1 time: 1.989998
4 time: 1.992945
8 time: 1.997071
16 time: 1.993442
24 time: 1.994212
32 time: 2.002103
64 time: 1.959601
128 time: 1.957994
256 time: 1.975517
384 time: 1.975143
512 time: 2.209696
1024 time: 2.437783
2048 time: 7.006168
3072 time: 5.306975
4096 time: 5.943510
5120 time: 2.396078
6144 time: 4.404022
7168 time: 4.900366
8192 time: 8.998624
9216 time: 6.574195

我的CPU是Intel(R)Core(TM)i3-2350M。 L1高速缓存:32K(用于数据),L2高速缓存256K,L3高速缓存3072K。
似乎没有遵循任何规则。我无法从中获取有关缓存大小或缓存级别的信息。
有人可以帮忙吗?

My CPU is Intel(R) Core(TM) i3-2350M. L1 Cache: 32K (for data), L2 Cache 256K, L3 Cache 3072K. Seems like it doesn't follow any rule. I can't get information of cache size or cache level from that. Could anybody give some help? Thanks in advance.

更新:
按照@Leeor的建议,我使用 j * 64 而不是 j * 16 。新结果:

1 time: 1.996282
4 time: 2.002579
8 time: 2.002240
16 time: 1.993198
24 time: 1.995733
32 time: 2.000463
64 time: 1.968637
128 time: 1.956138
256 time: 1.978266
384 time: 1.991912
512 time: 2.192371
1024 time: 2.262387
2048 time: 3.019435
3072 time: 2.359423
4096 time: 5.874426
5120 time: 2.324901
6144 time: 4.135550
7168 time: 3.851972
8192 time: 7.417762
9216 time: 2.272929
10240 time: 3.441985
11264 time: 3.094753

两个峰值,分别为4096K和8192K。还是很奇怪。

Two peaks, 4096K and 8192K. Still weird.

推荐答案

我不确定这是否是唯一的问题,但这绝对是最大的问题-您的代码会很快触发硬件流预取器,使您几乎总是遇到L1或L2延迟。

I'm not sure if this is the only problem here, but it's definitely the biggest one - your code would very quickly trigger the HW stream prefetchers, making you almost always hit in L1 or L2 latencies.

更多详细信息,请参见此处- http://software.intel.com / en-us / articles /使用硬件实现的预取器优化应用程序性能在英特尔Coret微体系结构上

More details can be found here - http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers

您应该禁用它们(通过BIOS或任何其他方式),或者至少通过替换 j * 16 (*每个int 4字节= 64B,一个缓存)来延长步长line-流检测器的经典步幅),其中 j * 64 (4条缓存行)。原因是-预取器可以在每个流请求中发出2次预取,因此当您进行单元跨步时,它会在代码之前运行,当代码跳过2行时仍会领先于您,但是随着时间的延长,它就变得毫无用处跳跃(由于模数的原因,3不好,您需要一个step_size的分隔线)

For your benchmark You should either disable them (through BIOS or any other means), or at least make your steps longer by replacing j*16 (* 4 bytes per int = 64B, one cache line - a classic unit stride for the stream detector), with j*64 (4 cache lines). The reason being - the prefetcher can issue 2 prefetches per stream request, so it runs ahead of your code when you do unit strides, may still get a bit ahead of you when your code is jumping over 2 lines, but become mostly useless with longer jumps (3 isn't good because of your modulu, you need a divider of step_size)

用新结果更新问题,我们可以找出是否还有其他问题

Update the questions with the new results and we can figure out if there's anything else here.

EDIT1
好​​,我运行了固定代码并得到了-

EDIT1: Ok, I ran the fixed code and got -

1 time: 1.321001
4 time: 1.321998
8 time: 1.336288
16 time: 1.324994
24 time: 1.319742
32 time: 1.330685
64 time: 1.536644
128 time: 1.536933
256 time: 1.669329
384 time: 1.592145
512 time: 2.036315
1024 time: 2.214269
2048 time: 2.407584
3072 time: 2.259108
4096 time: 2.584872
5120 time: 2.203696
6144 time: 2.335194
7168 time: 2.322517
8192 time: 5.554941
9216 time: 2.230817

如果您忽略了几列,则更有意义-您在32k(L1大小)之后跳转,但不是在256k(L2大小)之后跳转,我们对于384的结果太好了,仅在512k处跳转。最后一次跳转为8M(我的LLC大小),但9k又被打破了。

It makes much more sense if you ignore a few columns - you jump after the 32k (L1 size), but instead of jumping after 256k (L2 size), we get too good of a result for 384, and jump only at 512k. Last jump is at 8M (my LLC size), but 9k is broken again.

这使我们能够发现下一个错误-仅在大小为2的幂时与大小掩码AND才有意义,否则就不必回卷了,而是重复一些

This allows us to spot the next error - ANDing with size mask only makes sense when it's a power of 2, otherwise you don't wrap around, but instead repeat some of the last addresses again (which ends up in optimistic results since it's fresh in the cache).

尝试替换 ...& size_mask 加上%步长[i] / sizeof(int),该模数会更昂贵,但是如果您想要这些尺寸,则需要它(或者,当运行索引超出当前大小时,该索引会归零。

Try replacing the ... & size_mask with % steps[i]/sizeof(int), the modulu is more expensive but if you want to have these sizes you need it (or alternatively, a running index that gets zeroed whenever it exceeds the current size)

这篇关于编写程序以获取CPU缓存的大小和级别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆