如何在CPU中实现LRU缓存? [英] How is an LRU cache implemented in a CPU?

查看：370 发布时间：2020/6/29 21:13:16 caching cpu cpu-architecture cpu-cache lru

本文介绍了如何在CPU中实现LRU缓存?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在为一个面试学习，并且想刷新我的缓存记忆.如果CPU的高速缓存具有LRU替换策略，那么如何在芯片上实际实现呢?每个缓存行都会存储一个时间戳记吗?

在两个CPU同时写入一个地址的双核系统中，还会发生什么?

解决方案

对于只有两种方式的传统缓存，每套单个位可用于跟踪LRU.在访问任何命中的集合时，可以将位设置为未命中的方式.

对于更大的关联性，状态数量急剧增加:方式数量的阶乘.因此，四路高速缓存将具有24个状态，每组需要5位，而八路高速缓存将具有40,320个状态，每组需要16位.除了存储开销外，更新值还需要更大的开销.

对于4路高速缓存，状态的以下编码似乎可以正常工作:两位用于最近使用的方式编号，两位用于下一个最近使用的方式编号，以及一位指示是否最近使用了编号更高或更低的方式.

在MRU命中时，状态保持不变.
在下一个MRU命中时，将交换两个位字段.
在其他命中中，对其他两种方式的编号进行解码，将命中的方式编号放置在第一个两位部分中，将以前的MRU方式编号放置在第二个两位部分中.根据next-MRU方式编号是比未未命中的较少使用的方式更高还是更低，来设置最后一位.
未命中时，状态会被更新，就好像发生了LRU命中一样.

由于LRU跟踪具有此类开销，因此经常使用诸如二叉树伪LRU之类的简单机制.在命中时，这样只会更新树的每个分支部分，命中了其中一半的关联方式.对于具有两种方式W的幂，二叉树pLRU高速缓存每组的状态位将为W-1 . 8路高速缓存的方式6命中(使用3级二进制树)将清除树底的位，以指示方式(0,1,2,3)的下半部分较少最近使用，请清除下一个级别的较高位以指示这些方式的下半部分(4,5)较少使用，并在最终级别中将较高位设置为指示这些方式的上半部分(7)最近较少使用.不必为了更新状态而读取该状态可以简化硬件.

对于偏斜的关联性，其中不同的方式使用不同的哈希函数，有人提出了类似的缩写形式(例如，偏斜关联缓存的分析和替换"，Mark Brehob等，1997).使用未命中计数器比循环计数更合适，但是基本思想是相同的.

关于两个内核尝试同时写入同一高速缓存行时发生的情况，可以通过仅允许一个L1高速缓存在给定时间将高速缓存行置于独占状态来解决此问题.有效地进行一场比赛，一个核心将获得独占访问权.如果只有一个写入核心已经具有处于 shared 状态的缓存行，则它更有可能赢得比赛.在高速缓存行处于共享状态的情况下，高速缓存仅需要向高速缓存行的其他潜在持有者发送无效请求；在不存在高速缓存行的情况下，写操作通常需要请求数据的高速缓存行以及请求独占状态.

不同的内核写入同一缓存行(无论是相同的特定地址，还是在错误共享的情况下写入数据行内的另一个地址)都可能导致缓存行乒乓"，其中不同的内核使其他缓存中的缓存行无效以获得独占访问(以执行写操作)，以便缓存行像乒乓球一样在系统周围反弹.

I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick?

Also what happens in a dual core system where both CPUs write to the one address simultaneously?

解决方案

For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.

For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.

For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.

On a MRU hit, the state is unchanged.
On a next-MRU hit the two bit fields are swapped.
On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.
On a miss, the state is updated as if an LRU hit had occurred.

Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.

For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.

With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.

Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.

这篇关于如何在CPU中实现LRU缓存?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在CPU中实现LRU缓存? [英] How is an LRU cache implemented in a CPU?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在CPU中实现LRU缓存? [英] How is an LRU cache implemented in a CPU?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭