VIPT 到 PIPT 的转换如何在 L1->L2 驱逐上工作 [英] How does the VIPT to PIPT conversion work on L1->L2 eviction

查看:16
本文介绍了VIPT 到 PIPT 的转换如何在 L1->L2 驱逐上工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个场景出现在我的脑海中,看起来有点基础,但我会问.

This scenario came into my head and it seems a bit basic but I'll ask.

所以在 L1 中有一个虚拟索引和物理标签,但是集合变满了所以它被驱逐了.L1 控制器如何从 L1 中的虚拟索引和物理标签中获取完整的物理地址,以便将线插入 L2?我想它可以在 TLB 中搜索组合,但这似乎很慢,而且它可能根本不在 TLB 中.也许来自原始 TLB 转换的完整物理地址存储在缓存线旁边的 L1 中?

So there is a virtual index and physical tag in L1 but the set becomes full so it is evicted. How does the L1 controller get the full physical address from the virtual index and the physical tag in L1 so the line can be inserted into L2? I suppose it could search the TLB for the combination but that seems slow and also it may not be in the TLB at all. Perhaps the full physical address from the original TLB translation is stored in the L1 next to the cache line?

这也引发了一个更广泛的问题,即当 PMH 将访问的位写入 PTE 和 PDE 等时,它如何使 L1 条目无效.我的理解是它直接与 L2 缓存接口以获取物理地址,但是当它写入访问和修改的位时,以及在需要时发送 RFO 时,它必须反映 L1 中副本的更改(如果有)一,这意味着它必须知道物理地址的虚拟索引.在这种情况下,如果完整的物理地址也存储在 L1 中,那么它也为 L2 提供了一种能够对其进行索引的方法.

This also opens the wider question of how the PMH invalidates the L1 entry when it writes accessed bits to the PTEs and PDEs and so on. It is my understanding it interfaces with the L2 cache directly for physical addresses but when it writes accessed and modified bits, as well as sending an RFO if it needs to, it would have to reflect the change in the copy in the L1 if there is one, meaning it would have to know the virtual index of the physical address. In this case if the full physical address were also stored in the L1 then it offers a way for the L2 to be able to index it as well.

推荐答案

是的,外部缓存(几乎?)总是 PIPT,而内存本身显然需要物理地址.因此,当您将其发送到内存层次结构时,您需要一行的物理地址.

Yes, outer caches are (almost?) always PIPT, and memory itself obviously needs the physical address. So you need the physical address of a line when you send it out the memory hierarchy.

在 Intel CPU 中,VIPT L1 缓存具有来自地址的页内偏移部分的所有索引位,因此 virt=phys,避免了任何别名问题.它基本上是 PIPT,但仍然能够与 TLB 查找页码位并行地从集合中获取数据/标签,从而为标签比较器创建输入.

In Intel CPUs, the VIPT L1 caches have all the index bits from the offset-within-page part of the address, so virt=phys, avoiding any aliasing problems. It's basically PIPT but still being able to fetch data/tags from the set in parallel with the TLB lookup for the pagenumber bits to create an input for the tag comparator.

完整的物理地址仅从 L1d 索引 + 标签中得知,同样是因为它的行为类似于 PIPT,除了加载延迟之外的所有内容.

The full physical address is known just from L1d index + tag, again because it behaves like a PIPT for everything except load latency.

在虚拟索引缓存的一般情况下,其中一些索引位确实来自页码,这是一个很好的问题.这样的系统确实存在,操作系统经常使用页面着色来避免混叠.(所以他们不需要在上下文切换时刷新缓存.)

In the general case of virtually-indexed caches where some of the index bits do come from the page-number, that's a good question. Such systems do exist, and page-colouring is often used by the OS to avoid aliasing. (So they don't need to flush the cache on context switches.)

虚拟索引物理标记缓存 Synonym 有一个这样的 VIPT 的图表L1d:物理标签扩展了几位一直到页面偏移量,与顶部索引位重叠.

Virtually indexed physically tagged cache Synonym has a diagram for one such VIPT L1d: the physical tag is extended a few bits to come all the way down to the page offset, overlapping the top index bit.

很好的观察结果,写回缓存需要能够在对存储的 TLB 检查完成后很长时间内驱逐脏行.与加载不同的是,除非您将其存储在某处,否则您不会仍然拥有浮动的 TLB 结果.

Good observation that a write-back cache needs to be able to evict dirty lines long after the TLB check for the store was done. Unlike a load, you don't still have the TLB result floating around unless you stored it somewhere.

让标签包含页面偏移上方的所有物理地址位(即使与某些索引位重叠)可以解决这个问题.

Having the tag include all the physical address bits above the page offset (even if that overlaps some index bits) solves this problem.

另一种解决方案是直写缓存,因此您确实总是拥有来自 TLB 的物理地址与数据一起发送,即使它不能从缓存标签 + 索引重建.或者对于只读缓存,例如指令缓存,虚拟化不是问题.

Another solution would be a write-through cache, so you do always have the physical address from the TLB to send with the data, even if it's not reconstructable from the cache tag+index. Or for read-only caches, e.g. instruction caches, being virtual isn't a problem.

但我不认为驱逐时的 TLB 检查可以解决非重叠标签案例的问题:你没有完整的不再是虚拟地址,只有页码的低位是虚拟的(来自索引),其余的是物理的(来自标签).所以这不是 TLB 的有效输入.

But I don't think a TLB check at eviction could solve the problem for the non-overlapping tag case: you don't have the full virtual address anymore, only the low bits of your page-number are virtual (from the index), the rest are physical (from the tag). So this isn't a valid input to the TLB.

因此,除了效率低下之外,还有一个同样重要的问题,那就是它根本无法工作.:P 也许有一些我不知道的技巧或我遗漏的东西,但我认为即使是双向索引的特殊 TLB(phys->virt 和 virt->phys)也行不通,因为允许相同的物理页面.

So besides being inefficient, there's also the equally important problem that it wouldn't work at all. :P Maybe there's some trick I don't know or something I'm missing, but I don't think even a special TLB indexed both ways (phys->virt and virt->phys) could work, because multiple mappings of the same physical page are allowed.

我认为使用 VIVT 缓存的真实 CPU 通常将它们作为直写.我不太了解历史,无法肯定地说或引用任何例子.我看不出它们是如何写回的,除非它们为每一行存储了两个标签(物理和虚拟).

I think real CPUs that have used VIVT caches have normally had them as write-through. I don't know the history well enough to say for sure or cite any examples. I don't see how they could be write-back, unless they stored two tags (physical and virtual) for every line.

我认为早期的 RISC CPU 通常有 8k 直接映射缓存.

I think early RISC CPUs often had 8k direct-mapped caches.

但第一代经典 5 级 MIPS R2000(使用外部 SRAM对于它的 L1) 显然有一个 PIPT 回写缓存,如果 这些幻灯片中标有 MIPS R2000 的图表是正确的,显示了 14 位缓存索引从 TLB 结果的物理页号中取出一些位.但它仍然可以使用 2 个周期的加载延迟(1 个用于 EX 阶段的地址生成,1 个用于 MEM 阶段的缓存访问).

But first-gen classic 5-stage MIPS R2000 (using external SRAM for its L1) apparently had a PIPT write-back cache, if the diagram in these slides labeled MIPS R2000 is right, showing a 14-bit cache index taking some bits from the physical page number of the TLB result. But it still works with 2 cycle latency for loads (1 for address-generation in the EX stage, 1 for cache access in the MEM stage).

那时的时钟速度要低得多,而缓存+TLB 已经变得更大了.我猜当时 ALU 中的 32 位二进制加法器确实具有与 TLB + 缓存访问相当的延迟,可能不会使用激进的进位前瞻或进位选择设计.

Clock speeds were much lower in those days, and caches+TLBs have gotten larger. I guess back then a 32-bit binary adder in the ALU did have comparable latency to TLB + cache access, maybe not using as aggressive carry-lookahead or carry-select designs.

MIPS 4300i 数据表,(使用 MIPS 4200 的变体在 Nintendo 64 中)显示了在其 5 级管道中何时何地发生的事情,有些事情发生在时钟的上升沿或下降沿,让它在一个阶段内将一些事情分成半个时钟.(因此例如转发可以从一个阶段的前半部分工作到另一个阶段的后半部分,例如对于分支目标 -> 指令获取,仍然不需要在半阶段之间进行额外的锁存.)

A MIPS 4300i datasheet, (variant of MIPS 4200 used in Nintendo 64) shows what happens where/when in its 5-stage pipeline, with some things happening on the rising or falling edge of the clock, letting it divide some things up into half-clocks within a stage. (so e.g. forwarding can work from the first half of one stage to the 2nd half of another, e.g. for branch target -> instruction fetch, still without needing extra latching between half-stages.)

无论如何,它显示了 EX 中发生的 DVA(数据虚拟地址)计算:这是来自 lw $t0, 1234($t1) 的寄存器 + imm16.然后 DTLB 和 DCR(数据缓存读取)在数据缓存阶段的前半部分并行发生.(所以这是一个VIPT).DTC(数据标签检查)和 LA(负载对齐,例如 LWL/LWR 移位,或 LBU 从获取的字中提取字节)在第二半阶段并行发生.

Anyway, it shows DVA (data virtual address) calculation happening in EX: that's the register + imm16 from a lw $t0, 1234($t1). Then DTLB and DCR (data-cache read) happen in parallel in the first half of the Data Cache stage. (So this is a VIPT). DTC (Data Tag Check) and LA (load alignment e.g. shifting for LWL / LWR, or for LBU to extract a byte from a fetched word) happen in parallel in the 2nd half of the stage.

所以我还没有找到单周期(地址计算后)PIPT MIPS的确认.但这确实证实了单周期 VIPT 是一回事.从维基百科,我们知道它的 D-cache 是 8-kiB 直接映射回写.

So I still haven't found confirmation of a single-cycle (after address calculation) PIPT MIPS. But this is definite confirmation that single-cycle VIPT was a thing. From Wikipedia, we know that its D-cache was 8-kiB direct-mapped write-back.

这篇关于VIPT 到 PIPT 的转换如何在 L1->L2 驱逐上工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆