VIPT高速缓存:TLB与应用之间的连接;缓存? [英] VIPT Cache: Connection between TLB & Cache?

查看:165
本文介绍了VIPT高速缓存:TLB与应用之间的连接;缓存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是想澄清一下这个概念,并且可以找到足够详细的答案,可以对硬件中的实际工作方式有所了解.请提供任何相关详细信息.

I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details.

对于VIPT缓存,内存请求将同时发送到TLB和缓存.

In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache.

从TLB中我们获得了翻译后的物理地址. 从缓存索引中,我们可以获得标签列表(例如,从属于一个集合的所有缓存行中).

From the TLB we get the traslated physical address. From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set).

然后将翻译后的TLB地址与标签列表进行匹配以找到候选对象.

Then the translated TLB address is matched with the list of tags to find a candidate.

  • 我的问题是此检查在哪里执行?
    • 在缓存中?
    • 如果不在缓存中,还有什么地方?
    • My question is where is this check performed ?
      • In Cache ?
      • If not in Cache, where else ?
      • 从TLB到Cache模块是否有边带连接以获取 与标记地址进行比较所需的翻译后的物理地址?
      • is there a side-band connection from TLB to the Cache module to get the translated physical address needed for comparison with the tag addresses?

      有人可以实际"了解通常是如何实现的,以及Cache模块和& amp;之间的连接吗? TLB(MMU)模块?

      Can somebody please throw some light on "actually" how this is generally implemented and the connection between Cache module & the TLB(MMU) module ?

      我知道这取决于特定的体系结构和实现. 但是,当有VIPT缓存时,您知道什么实现?

      I know this dependents on the specific architecture and implementation. But, what is the implementation which you know when there is VIPT cache ?

      谢谢.

      推荐答案

      在这一详细级别上,您必须将缓存"和"TLB"分解为它们的组成部分.它们在设计中紧密紧密地互连在一起,该设计使用VIPT速度技巧,与标签获取并行进行翻译(即,利用所有索引位均低于页面偏移量并因此免费"翻译). ="https://stackoverflow.com/questions/4666728/why-is-the-size-of-l1-cache-小于-the-the-l-cache-in-most-of-the- pro/38549736#38549736>为什么在大多数处理器中,L1缓存的大小小于L2缓存的大小?)

      At this level of detail, you have to break "the cache" and "the TLB" down into their component parts. They're very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being translated "for free". Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?)

      L1dTLB本身是一个小型/快速的内容可寻址内存,例如( )64个条目和4路设置关联( Intel Skylake ).大型页面通常使用并行检查的第二个(和第三个)数组来处理,例如对于200万页和1G页,共32项4路输入:4项完全(4路)关联.

      The L1dTLB itself is a small/fast Content addressable memory with (for example) 64 entries and 4-way set associative (Intel Skylake). Hugepages are often handled with a second (and 3rd) array checked in parallel, e.g. 32-entry 4-way for 2M pages, and for 1G pages: 4-entry fully (4-way) associative.

      但是,现在,简化您的思维模式,而忽略大页面. L1dTLB是单个CAM,检查它是单个查找操作.

      But for now, simplify your mental model and forget about hugepages. The L1dTLB is a single CAM, and checking it is a single lookup operation.

      缓存" 至少包含以下部分:

      • 将标签和数据存储在集中的SRAM阵列
      • 控制逻辑,用于根据索引位获取一组数据和标签. (高性能L1d缓存通常与标记并行获取所有集合数据,以减少命中等待时间,而不是像选择更大的高度关联的缓存一样等待选择正确的标记.)
      • 比较器根据翻译后的地址检查标签,并在其中一个匹配的情况下选择正确的数据,或者触发未处理. (并且在命中时,更新LRU位以将此方式标记为最近使用")
      • the SRAM array that stores the tags + data in sets
      • control logic to fetch a set of data+tags based on the index bits. (High-performance L1d caches typically fetch data for all ways of the set in parallel with tags, to reduce hit latency vs. waiting until the right tag is selected like you would with larger more highly associative caches.)
      • comparators to check the tags against a translated address, and select the right data if one of them matches, or trigger miss-handling. (And on hit, update the LRU bits to mark this way as Most Recently Used)

      L1dTLB与L1D缓存不是真正分开的.我实际上没有设计硬件,但我认为现代高性能设计中的负载执行单元的工作原理如下:

      The L1dTLB is not really separate from the L1D cache. I don't actually design hardware, but I think a load execution unit in a modern high-performance design works something like this:

      • AGU从寄存器+偏移量生成一个地址.

      • AGU generates an address from register(s) + offset.

      (有趣的事实:Sandybridge-family乐观地将此过程简化为简单的寻址模式:如果reg值与reg+disp处于同一4k页,则[reg + 0-2047]的负载使用延迟比其他寻址模式低1c.一个href ="https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base">在以下情况下是否会受到惩罚base + offset与base在不同的页面中?)

      (Fun fact: Sandybridge-family optimistically shortcuts this process for simple addressing mode: [reg + 0-2047] has 1c lower load-use latency than other addressing modes, if the reg value is in the same 4k page as reg+disp. Is there a penalty when base+offset is in a different page than the base?)

      索引位来自地址的页内偏移部分,因此它们不需要从虚拟转换为物理.否则翻译是空话.与L1_size / associativity <= page_size一样,这种不带PIPT高速缓存的VIPT速度也可以正常工作.例如32kiB/8路= 4k页.

      The index bits come from the offset-within-page part of the address, so they don't need translating from virtual to physical. Or translation is a no-op. This VIPT speed with the non-aliasing of a PIPT cache works as long as L1_size / associativity <= page_size. e.g. 32kiB / 8-way = 4k pages.

      索引位选择一个集合.对于该集合的所有方式,将并行获取标签+数据. (这节省了节省延迟的能力,可能仅对于L1值得.更高的关联性(每组更多方式)L3缓存绝对不行)

      The index bits select a set. Tags+data are fetched in parallel for all ways of that set. (This costs power to save latency, and is probably only worth it for L1. Higher-associativity (more ways per set) L3 caches definitely not)

      如果存在标签匹配项,则高速缓存将从数据中以匹配的方式提取正确的字节(使用地址的行内低位偏移量和操作数大小).

      If there's a tag match, the cache extracts the right bytes from the data for the way that matched (using the offset-within-line low bits of the address, and the operand-size).

      或者不是取完整的64字节行,而是可以更早地使用偏移量位从每个方式取一个(对齐的)字.没有有效未对齐负载的CPU肯定是按照这种方式设计的.我不知道这样做是否值得,以在支持未对齐负载的CPU上为简单对齐的负载节省电量.

      Or instead of fetching the full 64-byte line, it could have used the offset bits earlier to fetch just one (aligned) word from each way. CPUs without efficient unaligned loads are certainly designed this way. I don't know if this is worth doing to save power for simple aligned loads on a CPU which supports unaligned loads.

      但是,现代Intel CPU(P6和更高版本)对未对齐的负载没有任何惩罚,即使对于32字节向量也是如此,只要它们不跨越高速缓存行边界即可.基于行内偏移,操作数大小和大小,并行进行8种方式的字节粒度索引可能不仅要获取整个8 x 64字节并在发生fetch + TLB时设置输出的多路复用,还要花费更多.特殊属性,例如零扩展或符号扩展,或广播负载.因此,一旦完成标签比较,来自选定方式的64字节数据就可能进入已配置好的多路复用器网络,该网络将获取正确的字节并进行广播或符号扩展.

      But modern Intel CPUs (P6 and later) have no penalty for unaligned load uops, even for 32-byte vectors, as long as they don't cross a cache-line boundary. Byte-granularity indexing for 8 ways in parallel probably costs more than just fetching the whole 8 x 64 bytes and setting up the muxing of the output while the fetch+TLB is happening, based on offset-within-line, operand-size, and special attributes like zero- or sign-extension, or broadcast-load. So once the tag-compare is done, the 64 bytes of data from the selected way might just go into an already-configured mux network that grabs the right bytes and broadcasts or sign-extends.

      AVX512 CPU甚至可以执行64字节全线加载.

      AVX512 CPUs can even do 64-byte full-line loads.

      如果L1dTLB CAM中没有匹配项,则整个缓存提取操作将无法继续.我不确定是否/如何管理此管道,以便在解决TLB-miss时其他负载可以继续执行.该过程涉及检查L2TLB(Skylake:用于4k和2M的统一1536入口12路,用于1G的16入口),如果失败,则进行分页浏览.

      If there's no match in the L1dTLB CAM, the whole cache fetch operation can't continue. I'm not sure if / how CPUs manage to pipeline this so other loads can keep executing while the TLB-miss is resolved. That process involves checking the L2TLB (Skylake: unified 1536 entry 12-way for 4k and 2M, 16-entry for 1G), and if that fails then with a page-walk.

      我认为TLB丢失会导致标记+数据获取被丢弃.找到所需的翻译后,将重新获取它们.在其他负载运行时,无处保留它们.

      I assume that a TLB miss results in the tag+data fetch being thrown away. They'll be re-fetched once the needed translation is found. There's nowhere to keep them while other loads are running.

      最简单的是,当翻译准备就绪时,它可以重新运行整个操作(包括从L1dTLB获取翻译),但是它可以通过简化过程并直接使用翻译来降低L2TLB匹配的延迟而不是将其放入L1dTLB并再次取出.

      At the simplest, it could just re-run the whole operation (including fetching the translation from L1dTLB) when the translation is ready, but it could lower the latency for L2TLB hits by short-cutting the process and using the translation directly instead of putting it into L1dTLB and getting it back out again.

      很明显,这要求dTLB和L1D必须真正设计在一起并紧密集成.由于他们只需要互相交谈,这是有道理的.硬件页面遍历通过L1D缓存获取数据. (页表始终具有已知的物理地址,以免出现catch 22/egg-egg问题).

      Obviously that requires that the dTLB and L1D are really designed together and tightly integrated. Since they only need to talk to each other, this makes sense. Hardware page walks fetch data through the L1D cache. (Page tables always have known physical addresses to avoid a catch 22 / chicken-egg problem).

      从TLB到缓存是否有边带连接?

      is there a side-band connection from TLB to the Cache?

      我不会称其为边带连接. L1D缓存是使用L1dTLB的 only 东西.同样,L1iTLB仅由L1I缓存使用.

      I wouldn't call it a side-band connection. The L1D cache is the only thing that uses the L1dTLB. Similarly, L1iTLB is used only by the L1I cache.

      如果有二级TLB,通常是统一的,因此L1iTLB和L1dTLB都会检查它们是否丢失.就像分离的L1I和L1D缓存通常会在丢失时检查统一的L2缓存一样.

      If there's a 2nd-level TLB, it's usually unified, so both the L1iTLB and L1dTLB check it if they miss. Just like split L1I and L1D caches usually check a unified L2 cache if they miss.

      外部高速缓存(L2,L3)非常通用.转换是在L1检查期间发生的,因此可以将物理地址发送到其他缓存.

      Outer caches (L2, L3) are pretty universally PIPT. Translation happens during the L1 check, so physical addresses can be sent to other caches.

      这篇关于VIPT高速缓存:TLB与应用之间的连接;缓存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆