现代缓存中的道路预测 [英] Way prediction in modern cache

查看:123
本文介绍了现代缓存中的道路预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们知道,直接映射高速缓存在高速缓存命中时间方面优于集合关联高速缓存,因为不涉及特定标签的搜索.另一方面,集合关联缓存通常比直接映射缓存显示出更高的命中率.

我读到现代处理器试图通过使用一种称为路途预测"的技术来结合两者的优点.他们在哪个位置预测了最有可能发生匹配的给定组的直线,并且仅在该直线中进行搜索.如果尝试导致未命中,请在集合的所有缓存行中使用常规的集合关联搜索.

我想了解这种方式预测的工作方式.预测硬件/逻辑的等待时间比完整搜索的等待时间小吗?

解决方案

AMD的Bulldozer和Ryzen系列的预测机制基于µtag,并在采用方式:探索AMD Cache Way Predictors的安全隐患"中进行了说明.(Moritz Lipp等人,2020, PDF ).

基于µtag的方式预测与虚拟地址的哈希而不是完整的虚拟地址匹配,因此,它不仅避免了像虚拟标记的缓存一样的地址转换开销,而且通过使用较少的存储空间,可以以较低的访问量访问预测数组延迟,并且检查标签时的延迟稍低.带路"反向工程表明,AMD的Bulldozer系列和Ryzen系列都使用第12位至第27位作为哈希函数,并且使用单个xor(⊕)层,从而减少了延迟.推土机家族使用12月21日,13月22日:,14月23日,15月24日,16月25日,17月26日,18月27日;Ryzen家族使用了12⊕27、13⊕26、14⊕25、15⊕20、16⊕21、17⊕22、18⊕23、19⊕24.

这些µtag哈希函数的两个方面值得注意.首先,通过使用较低有效位而不是全部48个有效虚拟地址位,由于减少了进位传播延迟,散列函数中使用的所有位都较早可用(地址生成涉及加法,尽管高性能加法器具有log(n)延迟,较低的有效位将更早提供).(此效果还意味着,用于确定高速缓存集的十二个最低有效位甚至更早可用,因此可以在计算µtag之前对预测器表进行索引.)其次,在Ryzen系列中,通常是最小变量(最大有效位)与散列的三位通常具有最高可变性(最低有效)的位进行异或;这样可以减少错误匹配的可能性.错误匹配是通过替换匹配而不是使用普通的(面向LRU的)替换策略来处理的;通常会导致更高的未命中率.

(已知最近的Intel x86处理器也使用基于µtag的方式.)

其他方式的预测示例

道路预测不是一项新技术.POWER6将一个11位标签为[14:17].([16:23]⊕[24:31])的µtag预测变量用于具有128条B高速缓存行的64 KiB 8路高速缓存.("IBM POWER6微体系结构",H.Q.Le等,2007).每个硬件线程中还包含一个有效位,以避免因同音异义字母而发生抖动(针对不同地址空间的有效地址匹配).与Ryzen一样,很明显地认识到最低有效位的变化更为频繁,因此这两个最低有效位与其他任何位都被异或.

Pentium4还使用了µtag预测器.根据"90nm技术上的英特尔®奔腾®4处理器的微体系结构",(Darrell Boggs等人,2004),90nm实现与先前的实现相比显着增加了部分地址匹配的大小,从而减少了假混叠情况的数量".详细信息似乎尚未发布.

MIPS R10000为其片外双向关联L2缓存使用了一个基于MRU的简单预测器.提供了8Ki个单比特预测条目,以指示集合中最近使用的缓存块.如果提供了超过8个Ki集(具有64个B块的16 MiB L2高速缓存最多支持128 Ki集),则不同的集将使用相同的预测位(预测器混叠).这种预测方法用于减少引脚数.一次只能读取一个标签,并且只能通过一种方式读取部分数据块.替代方案是直接映射缓存(HP PA-RISC使用大型片外,直接映射L1缓存)或专用(更昂贵)的芯片来处理标签比较(MIPS R8000使用特殊的标签SRAM,其中包括标签比较逻辑和使用比较结果来寻址保存数据的普通SRAM.

Alpha 21264指令高速缓存使用了集合和路径预测器,可以将其视为分支目标缓冲区的变体.对于四个4字节指令的每个对齐块,都包括对下一行(索引)和方式的预测.如果一大堆指令包含上次执行该指令的分支,则该分支的目标行和方式将是该行的预测.具有可变目标(包括呼叫返回)和更改是否采用分支的控制流指令会导致错误的预测,但是此预测器的准确性通常很高.

延迟和功耗注意事项

现代高性能处理器主要使用方式预测来减少访问能量,同时保持快速访问.由于支持32字节缓存访问(例如,用于AVX)和快速未对齐的负载(这实际上使访问大小增加了一倍),因此并行读取八种数据方式和(通常)仅读取一种数据方式之间的能量差是巨大的.标签读取和比较能量的节省由于需要读取和比较µtags而有所减少.(请注意,放宽TLB上的等待时间约束-使用物理标签和权限标签进行命中确认可以在执行单元已经使用数据的预测方式之后进行-也可以用来减少访问能量或增加TLB容量.)

直接映射的缓存由于无需在将数据转发到执行单元之前选择正确的方式而获得了延迟优势.选择正确的方法包括标签比较和多路复用器选择本身.然而,如果方式确定(或预测)等待时间小于数据访问等待时间,则用于集合关联性的唯一增加的等待时间是预热"响应的通过等待时间.多路复用器.由于标签阵列比数据阵列要小得多,因此它们的访问延迟要小得多,因此(在使用虚拟地址标签时)更容易确定在数据本身可用之前的方式.(在较早的处理器中,较小的缓存块(标签阵列的大小更接近数据阵列的大小),并且与逻辑延迟相比,布线延迟相对较低,这将在数据可用性变得更加困难之前适度地确定路径,并适度增加选择延迟的影响.)

We know that the direct-mapped caches are better than set-associative cache in terms of the cache hit time as there is no search involved for a particular tag. On the other hand, set-associative caches usually show better-hit rate than direct-mapped caches.

I read that the modern processors try to combine the benefit of both by using a technique called way-prediction. Where they predict the line of the given set where the hit is most likely to happen and search only in that line. If the attempt results in a miss, use the normal set-associative search in all the cache lines of the set.

I want to understand how does this way-prediction work. How is the latency of the prediction hardware/logic smaller than the search latency of the complete set?

解决方案

The way prediction mechanism for AMD's Bulldozer and Ryzen families are µtag-based and documented in "Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors" (Moritz Lipp et al., 2020, PDF).

µtag-based way prediction matches a hash of the virtual address rather than a full virtual address, so not only does it avoid the address translation overhead like a virtually tagged cache but by using less storage the prediction array can be accessed with lower latency and the tag checked with slightly lower latency. "Take A Way" reversed engineered that both AMD's Bulldozer family and Ryzen family use bits 12 through 27 for the hash function and that a single xor (⊕) layer is used, which reduces latency. The Bulldozer family used 12⊕21, 13⊕22:, 14⊕23, 15⊕24, 16⊕25, 17⊕26, 18⊕27; the Ryzen family used 12⊕27, 13⊕26, 14⊕25, 15⊕20, 16⊕21, 17⊕22, 18⊕23, 19⊕24.

Two aspects of these µtag hash functions are worth noting. First, by using less significant bits rather than the full 48 valid virtual address bits, all of the bits used in the hash function are available earlier because of reduced carry propagation delay (address generation involves an addition and although high performance adders have log(n) delay the less significant bits will still be available earlier). (This effect also means that the twelve least significant bits used to determine the cache set are available even earlier, so the predictor table can be indexed before the µtag has been computed.) Second, in the Ryzen family, the typically least variable (most significant) bits are xored with the typically most variable (least significant) bits for three bits of the hash; this should reduce the probability of false matches. False matches are handled by replacing the match rather than using the ordinary (LRU-oriented) replacement policy; this will usually result in a higher miss rate.

(Recent Intel x86 processors are also known to use µtag-based way predction.)

Other Way Prediction Examples

Way prediction is not a new technique. POWER6 used a µtag predictor with the 11-bit tags being [14:17].([16:23]⊕[24:31]) for a 64 KiB 8-way cache with 128 B cache lines. ("IBM POWER6 microarchitecture", H.Q. Le et al., 2007). One valid bit per hardware thread were also included to avoid thrashing for homonyms (effective address matches for different address spaces). As with Ryzen, there is clearly a recognition that the least significant bits vary more frequently so the two least significant bits are xored with any other bits.

The Pentium4 also used a µtag predictor. According to "The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology" (Darrell Boggs et al., 2004), the 90nm implementation "significantly increased the size of the partial address match from previous implementations, thus reducing the number of false aliasing cases". Details appear not to have been published.

The MIPS R10000 used a simple MRU-based way predictor for its off-chip two-way associative L2 cache. 8Ki single bit prediction entries were provided to indicate the most recently used cache block of a set. If more than 8 Ki sets were provided (up to 128 Ki sets were supported for a 16 MiB L2 cache with 64 B blocks), different sets would use the same prediction bit (predictor aliasing). This way prediction was used to reduce pin count; only one tag would be read at a time and part of the data block from only one way. The alternatives would be a direct-mapped cache (HP PA-RISC used large off-chip, direct-mapped L1 caches) or specialized (more expensive) chips for handling tag comparison (MIPS R8000 used special tag SRAMs that included tag comparison logic and used the comparison result to address ordinary SRAMs holding the data).

The Alpha 21264 instruction cache used a set and way predictor, which could be viewed as a variation of a branch target buffer. For each aligned chunk of four 4-byte instructions, a prediction of the next line (index) and way was included. If a chunk of instructions included a branch that was taken last time it was executed, that branch's target line and way would be the prediction for that line. Control flow instructions with variable targets (including call returns) and branches that change whether they are taken or not would result in mispredictions, but this predictor's accuracy was usually high.

Latency and Power Considerations

Modern high performance processors primarily use way prediction to reduce access energy while maintaining fast access. With support for 32-byte cache accesses (e.g., for AVX) and fast unaligned loads (which effectively doubles the access size), the energy difference between reading eight ways of data in parallel and (usually) only reading one way of data is substantial. The saving in tag read and compare energy are somewhat reduced by the need to read and compare µtags. (Note that relaxing the latency constraint on the TLB — hit confirmation using physical tags and permission tags can occur after the predicted way data is already being used by execution units — can also be exploited to reduce access energy or increase TLB capacity.)

Direct-mapped caches gain a latency advantage from not having to select the correct way before forwarding the data to execution units. Selecting the correct way includes tag comparison and the multiplexor selection itself. However, if the way determination (or prediction) latency is less than the data access latency, the only added latency for set associativity is the pass-through latency of the "warmed-up" multiplexors. Since tag arrays are much smaller than data arrays, their access latency is much smaller, so it is easier (especially with virtual address tags) to determine the way a little before the data itself is available. (In earlier processors, smaller cache blocks — tag array size closer to data array size — and relatively lower wire delay compared to logic delay would make completing way determination before data availability more difficult and modestly increase the impact of selection delay.)

这篇关于现代缓存中的道路预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆