CPU缓存行和prefetch政策 [英] cpu cacheline and prefetch policy

查看:235
本文介绍了CPU缓存行和prefetch政策的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看了这篇文章 http://igoro.com/archive/gallery -of处理器的高速缓存效果/ 。文章称,由于缓存线延误,code:

I read this article http://igoro.com/archive/gallery-of-processor-cache-effects/. The article said that because cacheline delay, the code:

int[] arr = new int[64 * 1024 * 1024];

// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;

几乎都会有相同的执行时间,我写了一些样品C code进行测试。我运行的Xeon(R)E3-1230 V2与Ubuntu 64位,ARMv6的兼容处理器转7 Debian中的code,同时还对酷睿2 T6600运行它。所有的结果都是文章说不是。

will almost have same execute time, and I wrote some sample c code to test it. I run the code on Xeon(R) E3-1230 V2 with Ubuntu 64bit, ARMv6-compatible processor rev 7 with Debian, and also run it on Core 2 T6600. All results are not what the article said.

我的code是如下:

long int jobTime(struct timespec start, struct timespec stop) {
    long int seconds = stop.tv_sec - start.tv_sec;
    long int nsec = stop.tv_nsec - start.tv_nsec;
    return seconds * 1000 * 1000 * 1000 + nsec;
}

int main() {
    struct timespec start;
    struct timespec stop;
    int i = 0;
    struct sched_param param;
    int * arr = malloc(LENGTH * 4);

    printf("---------sieofint %d\n", sizeof(int));
    param.sched_priority = 0;
    sched_setscheduler(0, SCHED_FIFO, &param);
    //clock_gettime(CLOCK_MONOTONIC, &start);
    //for (i = 0; i < LENGTH; i++) arr[i] *= 5;
    //clock_gettime(CLOCK_MONOTONIC, &stop);

    //printf("step %d : time %ld\n", 1, jobTime(start, stop));

    clock_gettime(CLOCK_MONOTONIC, &start);
    for (i = 0; i < LENGTH; i += 2) arr[i] *= 5;
    clock_gettime(CLOCK_MONOTONIC, &stop);

    printf("step %d : time %ld\n", 2, jobTime(start, stop));
}

每次我选择一件编译和运行(评论之一,另一个注释)。
编译:

Each time I choose one piece to compile and run (comment one and uncomment another). compile with:

gcc -O0 -o cache cache.c -lrt

在至强我得到这样的:

step 1 : 258791478
step 2 : 97875746

我想知道物品是否还是不说是正确的呢?同样的,最新的CPU具有更先进的prefetch政策?

I want to know whether or not what the article said was correct? Alternatively, do the newest cpus have more advanced prefetch policies?

推荐答案

简答题(TL; DR):你所访问未初始化的数据,你的第一个循环必须分配新的物理页定时循环内整个数组。

Short Answer (TL;DR): you're accessing uninitialized data, your first loop has to allocate new physical pages for the entire array within the timed loop.

当我运行code和评论每个依次部分,我几乎拿到了两个循环相同的时间。但是,我得到的时候我取消这两个部分并运行它们一个接一个的你报告了相同的结果。这使我怀疑你也这样做,并比较与第二个第一循环时,从冷启动的效果受到影响。这很容易检查 - 只需更换循环的顺序,看看第一仍然是慢

When I run your code and comment each of the sections in turn, I get almost the same timing for the two loops. However, I do get the same results you report when I uncomment both sections and run them one after the other. This makes me suspect you also did that, and suffered from cold start effect when comparing the first loop with the second. It's easy to check - just replace the order of the loops and see if the first is still slower.

要避免,要么选择一个足够大的长度(取决于你的系统),这样你没得到从第一回路的高速缓存的好处,帮助第二个,或者只是添加这不是定时的整个阵列的一个遍历。

To avoid, either pick a large enough LENGTH (depending on your system) so that you dont get any cache benefits from the first loop helping the second, or just add a single traversal of the entire array that's not timed.

请注意,第二个选择是不完全证明博客想说的话 - 这内存延迟口罩执行的延迟,所以也无所谓你如何使用高速缓存线的很多元素,你仍然瓶颈存储器访问时间(或者更精确地​​ - 带宽)

Note that the second option wouldn't exactly prove what the blog wanted to say - that memory latency masks the execution latency, so it doesn't matter how many elements of a cache line you use, you're still bottlenecked by the memory access time (or more accurately - the bandwidth)

此外 - 标杆code与 -O0 是一个非常糟糕的做法

Also - benchmarking code with -O0 is a really bad practice

编辑:

下面是我得到什么(删除,因为它是不相关的调度)。结果
这code:

Here's what i'm getting (removed the scheduling as it's not related).
This code:

for (i = 0; i < LENGTH; i++) arr[i] = 1;   // warmup!

clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i++) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
printf("step %d : time %ld\n", 1, jobTime(start, stop));

clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i+=16) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);

给出:

---------sieofint 4
step 1 : time 58862552
step 16 : time 50215446

虽然注​​释热身行给出了相同的优点你在第二环路报告

While commenting the warmup line gives the same advantage as you reported on the second loop:

---------sieofint 4
step 1 : time 279772411
step 16 : time 50615420

更换环路的顺序(暖机仍然评论)示出它确实不相关的步骤的大小,但在订购:

Replacing the order of the loops (warmup is still commented) shows it's indeed not related to the step size but to the ordering:

---------sieofint 4
step 16 : time 250033980
step 1 : time 59168310

(gcc版本4.6.3,在Opteron 6272)

(gcc version 4.6.3, on Opteron 6272)

现在一个关于什么是怎么回事注 - 从理论上讲,你所期望的热身有意义只有当阵列足够小,在一些高速缓存坐 - 在这种情况下,长度您使用的是即使是L3太大大多数机器上。但是,你忘了页映射 - 你不只是跳过升温数据本身 - 避免你的首先初始化它。这可以永远不可能给你在现实生活中有意义的结果,但由于这是一个标杆,你没有注意到,你只是乘以垃圾数据为它的延迟。

Now a note about what's going on here - in theory, you'd expect warmup to be meaningful only when the array is small enough to sit in some cache - in this case the LENGTH you used is too big even for the L3 on most machines. However, you're forgetting the pagemap - you didn't just skip warming the data itself - you avoided initializing it in the first place. This can never give you meaningful results in real life, but since this a benchmark you didn't notice that, you're just multiplying junk data for the latency of it.

这意味着,你在第一个循环访问每个新的页面并不只去记忆,它可能会得到一个页面错误,并纷纷致电OS到新的物理页面映射为它。这是一个漫长的过程,通过使用4K页面的数量乘以 - 积累了非常长的时间。在这个数组大小,你甚至不能受益的TLB(你有16K不同的物理4K页面,这样比大多数的TLB可以用2级甚至支持),所以它只是的错流的问题。这或许可以被任何分析工具的措施。

This means that each new page you access on the first loop doesn't only go to memory, it would probably get a page fault and have to call the OS to map a new physical page for it. This is a lengthy process, multiplies by the number of 4K pages you use - accumulating to a very long time. At this array size you can't even benefit from TLBs (you have 16k different physical 4k pages, way more than most TLBs can support even with 2 levels), so it's just the question of the fault flows. This can probably be measures by any profiling tool.

在同一阵列上的第二次迭代不会有这样的效果,并会快很多 - 尽管仍然需要做的每一个新页面上一个完整的pagewalk(这是纯粹的做在HW),然后取从数据记忆。

The second iteration on the same array won't have this effect and would be much faster - even though is still has to do a full pagewalk on each new page (that's done purely in HW), and then fetch the data from memory.

顺便说一句,这也是当你标杆一些行为,你重复同样的事情多次(在这种情况下,它会解决你的问题,如果你已经在阵列几个时间以相同的步伐运行,原因忽略了前几​​轮)。

By the way, this is also the reason when you benchmark some behavior, you repeat the same thing multiple times (in this case it would have solved your problem if you had run over the array several time with the same stride, and ignored the first few rounds).

这篇关于CPU缓存行和prefetch政策的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆