当L1丢失与L2访问有很大不同时...与TLB相关吗? [英] When L1 misses are a lot different than L2 accesses... TLB related?

查看:184
本文介绍了当L1丢失与L2访问有很大不同时...与TLB相关吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在某些算法上运行一些基准测试,并分析它们的内存使用情况和效率(L1/L2/TLB访问和未命中),其中一些结果对我来说很有趣.

I have been running some benchmarks on some algorithms and profiling their memory usage and efficiency (L1/L2/TLB accesses and misses), and some of the results are quite intriguing for me.

考虑到包含性的缓存层次结构(L1和L2缓存), L1缓存未命中的次数是否应该与 L2缓存访问次数一致?我发现的一种解释是与TLB相关的:当虚拟地址未在TLB中映射时,系统会自动跳过某些缓存级别中的搜索. 这看起来合法吗?

Considering an inclusive cache hierarchy (L1 and L2 caches), shouldn't the number of L1 cache misses coincide with the number of L2 cache accesses? One of the explanations I find would be TLB related: when a virtual address is not mapped in TLB, the system automatically skips searches in some cache levels. Does this seem legitimate?

推荐答案

首先,包含性缓存层次结构可能并不像您想象的那么普遍.例如,我认为目前没有任何Intel处理器-没有Nehalem,Sandybridge和Atoms-没有包含在L2中的L1. (不过,Nehalem以及可能的Sandybridge确实将L1和L2都包含在L3中;使用Intel当前的术语,LLC中的FLC和MLC.)

First, inclusive cache hierarchies may not be so common as you assume. For example, I do not think any current Intel processors - not Nehalem, not Sandybridge, possibly Atoms - have an L1 that is included within the L2. (Nehalem and probably Sandybridge do, however, have both L1 and L2 included within L3; using Intel's current terminology, FLC and MLC in LLC.)

但是,这不一定重要.在大多数缓存层次结构中,如果您有L1缓存未命中,那么可能会在L2中查找该未命中.包含与否无关紧要.否则,您将不得不告诉您您关心的数据(可能)不在L2中,您无需查看.虽然我已经设计了可以做到这一点的协议和内存类型-例如一种仅在L1中缓存而不在L2中缓存的内存类型,对于诸如图形之类的东西很有用,您可以获得在L1中进行合并的好处,但是在多个阵列上反复扫描,因此在L2中进行缓存不是一个好主意.有点我目前还没有人发货.

But, this doesn't necessarily matter. In most cache hierarchies if you have an L1 cache miss, then that miss will probably be looked up in the L2. Doesn't matter if it is inclusive or not. To do otherwise, you would have to have something that told you that the data you care about is (probably) not in the L2, you don't need to look. Although I have designed protocols and memory types that do this - e.g. a memory type that cached only in the L1 but not the L2, useful for stuff like graphics where you get the benefits of combining in the L1, but where you are repeatedly scanning over a large array, so caching in the L2 not a good idea. Bit I am not aware of anyone shipping them at the moment.

无论如何,这是L1高速缓存未命中数可能不等于L2高速缓存访​​问数的一些原因.

Anyway, here are some reasons why the number of L1 cache misses may not be equal to the number of L2 cache accesses.

您没有说您正在使用什么系统-我知道我的答案适用于Intel x86,例如Nehalem和Sandybridge,它们的EMON性能事件监视使您能够计算L1和L2缓存未命中率等.它可能还将适用于任何具有用于缓存未命中的硬件性能计数器的现代微处理器,例如ARM和Power上的那些.

You don't say what systems you are working on - I know my answer is applicable to Intel x86s such as Nehalem and Sandybridge, whose EMON performance event monitoring allows you to count things such as L1 and L2 cache misses, etc. It will probably also apply to any modern microprocessor with hardware performance counters for cache misses, such as those on ARM and Power.

大多数现代微处理器不会在第一个高速缓存未命中时停止,而是继续尝试做额外的工作.总体上,这通常称为投机执行.此外,处理器可能是乱序的,也可能是乱序的,但是尽管后者可能会给您带来更大的L1丢失次数和L2访问次数之间的差异,但这不是必须的-即使在顺序错误的情况下,您也可以得到这种行为订单处理器.

Most modern microprocessors do not stop at the first cache miss, but keep going trying to do extra work. This is overall often called speculative execution. Furthermore, the processor may be in-order or out-of-order, but although the latter may given you even greater differences between number of L1 misses and number of L2 accesses, it's not necessary - you can get this behavior even on in-order processors.

简短的回答:这些推测性内存访问中有许多将位于同一内存位置.他们将被压扁并合并在一起.

性能事件"L1缓存未命中"可能是[*]计算错过L1缓存的(推测性)指令的数量.然后在其他地方分配一个硬件数据结构(在Intel称为填充缓冲区),在其他地方放置未命中状态处理寄存器.到同一高速缓存行的后续高速缓存未命中将丢失L1高速缓存,但会命中填充缓冲区,并且将被压缩.其中只有一个,通常是第一个,将被发送到L2,并算作L2访问.)

The performance event "L1 cache misses" is probably[*] counting the number of (speculative) instructions that missed the L1 cache. Which then allocate a hardware data structure, called at Intel a fill buffer, at some other places a miss status handling register. Subsequent cache misses that are to the same cache line will miss the L1 cache but hit the fill buffer, and will get squashed. Only one of them, typically the first will get sent to the L2, and counted as an L2 access.)

顺便说一句,可能会有一个性能事件:Squashed_Cache_Misses.

By the way, there may be a performance event for this: Squashed_Cache_Misses.

可能还会有一个性能事件L1_Cache_Misses_Retired.但这可能会低估,因为推测可能会将数据拉入缓存,并且永远不会发生缓存遗失.

There may also be a performance event L1_Cache_Misses_Retired. But this may undercount, since speculation may pull the data into the cache, and a cache miss at retirement may never occur.

([*]顺便说一句,当我在这里说可能"时,我的意思是在我帮助设计的机器上".几乎可以肯定.我可能必须检查定义,看看RTL,但我会如果没有的话,会非常惊讶.几乎可以保证.)

([*] By the way, when I say "probably" here I mean "On the machines that I helped design". Almost definitely. I might have to check the definition, look at the RTL, but I would be immensely surprised if not. It is almost guaranteed.)

例如假设您正在访问字节A [0],A [1],A [2],... A [63],A [64],...

E.g. imagine that you are accessing bytes A[0], A[1], A[2], ... A[63], A[64], ...

如果A [0]的地址等于零(以64为模),则A [0] .. A [63]将位于具有64字节高速缓存行的计算机上的同一高速缓存行中.如果使用这些代码的代码很简单,则很有可能所有这些代码都可以推测性地发布. QED:64个推测性内存访问,64个L1缓存未命中,但是只有一个L2内存访问.

If the address of A[0] is equal to zero modulo 64, then A[0]..A[63] will be in the same cache line, on a machine with 64 byte cache lines. If the code that uses these is simple, it is quite possible that all of them can be issued speculatively. QED: 64 speculative memory access, 64 L1 cache misses, but only one L2 memory access.

(顺便说一句,不要指望这些数字会那么干净.每个L2访问可能不会完全得到64个L1访问.)

(By the way, don't expect the numbers to be quite so clean. You might not get exactly 64 L1 accesses per L2 access.)

更多可能性:

如果L2访问次数大于L1高速缓存未命中次数(我几乎从未见过,但是有可能),那么您的内存访问模式可能会使硬件预取器感到困惑.硬件预取器会尝试预测您将需要哪些高速缓存行. 如果预取器预测不正确,则可能会获取您实际上不需要的高速缓存行.有时,有时会出现性能方面的问题,需要对Prefetches_from_L2或Prefetches_from_Memory进行计数.

If the number of L2 accesses is greater than the number of L1 cache misses (I have almost never seen it, but it is possible) you may have a memory access pattern that is confusing a hardware prefetcher. The hardware prefetcher tries to predict which cache lines you are going to need. If the prefetcher predicts badly, it may fetch cache lines that you don't actually need. Oftentimes there is a performance evernt to count Prefetches_from_L2 or Prefetches_from_Memory.

某些计算机可能会取消导致L1高速缓存未命中的推测性访问,然后再将它们发送到L2.但是,我不知道英特尔会这样做.

Some machines may cancel speculative accesses that have caused an L1 cache miss, before they are sent to the L2. However, I don't know of Intel doing this.

这篇关于当L1丢失与L2访问有很大不同时...与TLB相关吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆