内存基准测试图:了解缓存行为 [英] Memory benchmark plot: understanding cache behaviour

查看:115
本文介绍了内存基准测试图:了解缓存行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试了各种可能得出的推理,但我并不真正理解这种情节. 它基本上显示了从不同大小的数组以不同的步幅进行读取和写入的性能. 我知道对于4字节这样的小步幅,我读取了缓存中的所有单元,因此,我具有良好的性能.但是,当我拥有2 MB阵列和4k步幅时会发生什么?还是4M和4k大步前进?为什么性能这么差?最后,为什么当我有1MB阵列并且步幅为1/8的大小性能不错时,为什么当1/4的大小性能最差,然后再减小一半的大小时,性能却超级好? 请帮助我,这件事使我发疯.

I've tried every kind of reasoning I can possibly came out with but I don't really understand this plot. It basically shows the performance of reading and writing from different size array with different stride. I understand that for small stride like 4 bytes I read all the cell in the cache, consequently I have good performance. But what happen when I have the 2 MB array and the 4k stride? or the 4M and 4k stride? Why the performance are so bad? Finally why when I have 1MB array and the stride is 1/8 of the size performance are decent, when is 1/4 the size performance get worst and then at half the size, performance are super good? Please help me, this thing is driving me mad.

在此链接上,代码为: https://dl.dropboxusercontent. com/u/18373264/membench/membench.c

At this link, the code: https://dl.dropboxusercontent.com/u/18373264/membench/membench.c

推荐答案

您的代码在给定的时间间隔内循环而不是恒定的访问次数,您不是在比较相同的工作量,也不是所有的缓存大小/跨度享受相同的重复次数(因此它们获得不同的缓存机会).

Your code loops for a given time interval instead of constant number of access, you're not comparing the same amount of work, and not all cache sizes/strides enjoy the same number of repetitions (so they get different chance for caching).

还请注意,由于您不在任何地方使用temp,因此第二个循环可能会被优化(内部for).

Also note that the second loop will probably get optimized away (the internal for) since you don't use temp anywhere.

此处的另一个影响是TLB利用率:

在4k页面系统上,当步幅仍小于4k时,随着步幅的增长,您将越来越少地利用每页(最终在4k步幅上达到每页一次访问),这意味着访问量会不断增长您每次访问都必须访问第二级TLB的次数(甚至可能至少部分地序列化了您的访问权限).
由于按步幅大小对迭代计数进行归一化,因此通常,(size / stride)访问位于最内部的循环中,而* stride位于外部.但是,您访问的唯一页面的数量不同-对于2M阵列,2k步幅,您在内部循环中将具有1024次访问,但是只有512个唯一页面,因此对TLB L2的访问为512 * 2k.在4k跨度上,仍将有512个唯一页面,但有512 * 4k TLB L2访问权限.
对于1M阵列情况,您总共将拥有256个唯一页面,因此2k跨度将具有256 * 2k TLB L2访问,而4k将再次具有两次.

On a 4k page system, as you grow your strides while they're still <4k, you'll enjoy less and less utilization of each page (finally reaching one access per page on the 4k stride), meaning growing access times as you'll have to access the 2nd level TLB on each access (possibly even serializing your accesses, at least partially).
Since you normalize your iteration count by the stride size, you'll have in general (size / stride) accesses in your innermost loop, but * stride outside. However, the number of unique pages you access differs - for 2M array, 2k stride, you'll have 1024 accesses in the inner loop, but only 512 unique pages, so 512*2k accesses to TLB L2. on the 4k stride, there would be 512 unique pages still, but 512*4k TLB L2 accesses.
For the 1M array case, you'll have 256 unique pages overall, so the 2k stride would have 256 * 2k TLB L2 accesses, and the 4k would again have twice.

这解释了为什么在接近4k时,每行的性能会逐渐下降,以及为什么数组大小每增加一倍,同一步幅的时间就会增加一倍.较小的阵列大小可能仍会部分享受L1 TLB的功能,因此您看不到相同的效果(尽管我不确定为什么会有512k).

This explains both why there's gradual perf drop on each line as you approach 4k, as well as why each doubling in array size doubles the time for the same stride. The lower array sizes may still partially enjoy the L1 TLB so you don't see the same effect (although i'm not sure why 512k is there).

现在,一旦您开始将步伐提高到4k以上,您就会突然再次受益,因为您实际上跳过了整页.对于相同的数组大小,跨度为8K的访问将只能访问其他所有页面,而将总TLB访问的一半作为4k进行访问,等等.

Now, once you start growing the stride above 4k, you suddenly start benefiting again since you're actually skipping whole pages. 8K stride would access only every other page, taking half the overall TLB accesses as 4k for the same array size, and so on.

这篇关于内存基准测试图:了解缓存行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆