VIPT Cache:TLB 与 Cache 之间的连接缓存? [英] VIPT Cache: Connection between TLB & Cache?

查看:35
本文介绍了VIPT Cache:TLB 与 Cache 之间的连接缓存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是想澄清这个概念,并且可以找到足够详细的答案,这些答案可以对硬件中的一切实际运作方式有所了解.请提供任何相关详细信息.

I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details.

在 VIPT 缓存的情况下,内存请求并行发送到 TLB 和缓存.

In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache.

从 TLB 中我们得到了被翻译的物理地址.从缓存索引中,我们得到一个标签列表(例如,来自属于一个集合的所有缓存行).

From the TLB we get the traslated physical address. From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set).

然后将翻译后的 TLB 地址与标签列表进行匹配以找到候选者.

Then the translated TLB address is matched with the list of tags to find a candidate.

  • 我的问题是在哪里执行此检查?
    • 在缓存中?
    • 如果不在缓存中,还能在哪里?
    • 是否存在从 TLB 到缓存模块的边带连接以获取需要转换的物理地址与标签地址进行比较吗?

    有人可以解释一下实际上"是如何实现的以及缓存模块和 Cache 模块之间的连接吗?TLB(MMU) 模块?

    Can somebody please throw some light on "actually" how this is generally implemented and the connection between Cache module & the TLB(MMU) module ?

    我知道这取决于特定的架构和实现.但是,当有 VIPT 缓存时,您知道什么实现?

    I know this dependents on the specific architecture and implementation. But, what is the implementation which you know when there is VIPT cache ?

    谢谢.

    推荐答案

    在这个细节级别,你必须打破缓存"和TLB"细化到它们的组成部分.它们在一个设计中非常紧密地相互连接,该设计使用了与标签提取并行翻译的 VIPT 速度技巧(即利用所有低于页面偏移量的索引位,因此被免费"翻译.相关:为什么大多数处理器中L1缓存的大小比L2缓存小?)

    At this level of detail, you have to break "the cache" and "the TLB" down into their component parts. They're very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being translated "for free". Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?)

    L1dTLB 本身是一个小型/快速的内容可寻址内存(例如) 64 个条目和 4 路集合关联 (Intel Skylake).大页通常使用并行检查的第二个(和第三个)数组处理,例如32-entry 4-way for 2M pages, and for 1G pages: 4-entry full (4-way) associative.

    The L1dTLB itself is a small/fast Content addressable memory with (for example) 64 entries and 4-way set associative (Intel Skylake). Hugepages are often handled with a second (and 3rd) array checked in parallel, e.g. 32-entry 4-way for 2M pages, and for 1G pages: 4-entry fully (4-way) associative.

    但是现在,请简化您的心智模型并忘记大页面.L1dTLB 是单个 CAM,检查它是单个查找操作.

    But for now, simplify your mental model and forget about hugepages. The L1dTLB is a single CAM, and checking it is a single lookup operation.

    缓存"至少由以下部分组成:

    • 存储标签 + 数据的 SRAM 数组
    • 基于索引位获取一组数据+标签的控制逻辑.(高性能 L1d 缓存通常与标签并行获取集合中所有方式的数据,以减少命中延迟而不是等待直到选择正确的标签,就像使用更大、关联性更高的缓存一样.)
    • 比较器根据转换后的地址检查标签,如果其中之一匹配,则选择正确的数据,或触发错误处理.(并且在命中时,更新 LRU 位以将此方式标记为最近使用).有关没有 TLB 的 2 路关联缓存的基础图,请参阅 https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=17.圆圈内的 = 是比较器:如果标签宽度输入相等,则产生布尔真输出.
    • the SRAM array that stores the tags + data in sets
    • control logic to fetch a set of data+tags based on the index bits. (High-performance L1d caches typically fetch data for all ways of the set in parallel with tags, to reduce hit latency vs. waiting until the right tag is selected like you would with larger more highly associative caches.)
    • comparators to check the tags against a translated address, and select the right data if one of them matches, or trigger miss-handling. (And on hit, update the LRU bits to mark this way as Most Recently Used). For a diagram of the basics for a 2-way associative cache without a TLB, see https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=17. The = inside a circle is the comparator: producing a boolean true output if the tag-width inputs are equal.

    L1dTLB 并没有真正与 L1D 缓存分开.我实际上并不设计硬件,但我认为现代高性能设计中的加载执行单元的工作方式是这样的:

    The L1dTLB is not really separate from the L1D cache. I don't actually design hardware, but I think a load execution unit in a modern high-performance design works something like this:

    • AGU 根据寄存器 + 偏移量生成地址.

    • AGU generates an address from register(s) + offset.

    (有趣的事实:Sandybridge 家族乐观地将这个过程简化为简单寻址模式:[reg + 0-2047] 比其他寻址模式的负载使用延迟低 1c,如果 reg 值为在与 reg+disp 相同的 4k 页面中.当 base+offset 与 base 位于不同的页面时是否有惩罚?)

    (Fun fact: Sandybridge-family optimistically shortcuts this process for simple addressing mode: [reg + 0-2047] has 1c lower load-use latency than other addressing modes, if the reg value is in the same 4k page as reg+disp. Is there a penalty when base+offset is in a different page than the base?)

    索引位来自地址的页内偏移量部分,因此它们不需要从虚拟转换为物理.或者翻译是一个空操作.只要 L1_size/associativity <= page_size,这种具有 PIPT 缓存非混叠的 VIPT 速度就可以工作.例如32kiB/8 路 = 4k 页.

    The index bits come from the offset-within-page part of the address, so they don't need translating from virtual to physical. Or translation is a no-op. This VIPT speed with the non-aliasing of a PIPT cache works as long as L1_size / associativity <= page_size. e.g. 32kiB / 8-way = 4k pages.

    索引位选择一个集合.标签+数据是针对该集合的所有方式并行获取的.(这会消耗电力以节省延迟,并且可能只对 L1 值得.更高关联性(每组更多方式)L3 缓存绝对不是)

    The index bits select a set. Tags+data are fetched in parallel for all ways of that set. (This costs power to save latency, and is probably only worth it for L1. Higher-associativity (more ways per set) L3 caches definitely not)

    在 L1dTLB CAM 数组中查找地址的高位.

    The high bits of the address are looked up in the L1dTLB CAM array.

    标签比较器接收转换后的物理地址标签和从该集合中提取的标签.

    The tag comparator receives the translated physical-address tag and the fetched tags from that set.

    如果存在标记匹配,缓存会从匹配的数据中提取正确的字节(使用地址的行内偏移低位和操作数大小).

    If there's a tag match, the cache extracts the right bytes from the data for the way that matched (using the offset-within-line low bits of the address, and the operand-size).

    或者不是获取完整的 64 字节行,它可以使用更早的偏移位从每条路径获取一个(对齐的)字.没有高效未对齐负载的 CPU 肯定是这样设计的.我不知道这样做是否值得为支持未对齐负载的 CPU 上的简单对齐负载省电.

    Or instead of fetching the full 64-byte line, it could have used the offset bits earlier to fetch just one (aligned) word from each way. CPUs without efficient unaligned loads are certainly designed this way. I don't know if this is worth doing to save power for simple aligned loads on a CPU which supports unaligned loads.

    但是现代 Intel CPU(P6 及更高版本)对未对齐的加载 uops 没有任何惩罚,即使对于 32 字节向量,只要它们不跨越缓存线边界.基于行内偏移量、操作数大小和 fetch+TLB 发生时,并行 8 路的字节粒度索引可能比获取整个 8 x 64 字节和设置输出的混合花费更多特殊属性,如零或符号扩展,或广播负载.因此,一旦标记比较完成,来自所选方式的 64 字节数据可能会进入已配置的多路复用网络,该网络获取正确的字节并进行广播或符号扩展.

    But modern Intel CPUs (P6 and later) have no penalty for unaligned load uops, even for 32-byte vectors, as long as they don't cross a cache-line boundary. Byte-granularity indexing for 8 ways in parallel probably costs more than just fetching the whole 8 x 64 bytes and setting up the muxing of the output while the fetch+TLB is happening, based on offset-within-line, operand-size, and special attributes like zero- or sign-extension, or broadcast-load. So once the tag-compare is done, the 64 bytes of data from the selected way might just go into an already-configured mux network that grabs the right bytes and broadcasts or sign-extends.

    AVX512 CPU 甚至可以执行 64 字节全行加载.

    AVX512 CPUs can even do 64-byte full-line loads.

    如果 L1dTLB CAM 中没有匹配项,则整个缓存获取操作将无法继续.我不确定 CPU 是否/如何管理它以便其他负载可以在解决 TLB 未命中时继续执行.该过程涉及检查 L2TLB(Skylake:统一 1536 条目 12 路用于 4k 和 2M,16 条目用于 1G),如果失败,则进行页面遍历.

    If there's no match in the L1dTLB CAM, the whole cache fetch operation can't continue. I'm not sure if / how CPUs manage to pipeline this so other loads can keep executing while the TLB-miss is resolved. That process involves checking the L2TLB (Skylake: unified 1536 entry 12-way for 4k and 2M, 16-entry for 1G), and if that fails then with a page-walk.

    我假设 TLB 未命中会导致标签+数据提取被丢弃.一旦找到所需的翻译,它们将被重新获取.当其他负载正在运行时,它们无处可放.

    I assume that a TLB miss results in the tag+data fetch being thrown away. They'll be re-fetched once the needed translation is found. There's nowhere to keep them while other loads are running.

    最简单的情况是,当翻译准备好时,它可以重新运行整个操作(包括从 L1dTLB 获取翻译),但它可以通过缩短过程并直接使用翻译来降低 L2TLB 命中的延迟而不是将其放入 L1dTLB 并再次取出.

    At the simplest, it could just re-run the whole operation (including fetching the translation from L1dTLB) when the translation is ready, but it could lower the latency for L2TLB hits by short-cutting the process and using the translation directly instead of putting it into L1dTLB and getting it back out again.

    显然,这需要将 dTLB 和 L1D 真正设计在一起并紧密集成.因为他们只需要互相交谈,这是有道理的.硬件页面遍历通过 L1D 缓存获取数据.(页表总是有已知的物理地址,以避免捕获 22/鸡蛋问题).

    Obviously that requires that the dTLB and L1D are really designed together and tightly integrated. Since they only need to talk to each other, this makes sense. Hardware page walks fetch data through the L1D cache. (Page tables always have known physical addresses to avoid a catch 22 / chicken-egg problem).

    是否存在从 TLB 到缓存的边带连接?

    is there a side-band connection from TLB to the Cache?

    我不会称之为边带连接.L1D 缓存是唯一 使用 L1dTLB 的东西.同样,L1iTLB 仅被 ​​L1I 缓存使用.

    I wouldn't call it a side-band connection. The L1D cache is the only thing that uses the L1dTLB. Similarly, L1iTLB is used only by the L1I cache.

    如果有二级TLB,通常是统一的,所以L1iTLB和L1dTLB都会检查,如果他们错过了.就像拆分 L1I 和 L1D 缓存通常在未命中时检查统一的 L2 缓存一样.

    If there's a 2nd-level TLB, it's usually unified, so both the L1iTLB and L1dTLB check it if they miss. Just like split L1I and L1D caches usually check a unified L2 cache if they miss.

    外部缓存(L2、L3)是非常普遍的 PIPT.转换发生在 L1 检查期间,因此可以将物理地址发送到其他缓存.

    Outer caches (L2, L3) are pretty universally PIPT. Translation happens during the L1 check, so physical addresses can be sent to other caches.

    这篇关于VIPT Cache:TLB 与 Cache 之间的连接缓存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆