弱ISA如何使用存储缓冲区解决WAW内存的危害? [英] How do weak ISAs resolve WAW memory hazards using the store buffer?

查看：55 发布时间：2021/4/24 21:06:21 cpu-architecture

本文介绍了弱ISA如何使用存储缓冲区解决WAW内存的危害?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

现代CPU使用存储缓冲区将提交到缓存的时间延迟到退役之前，还避免了WAR和WAW 内存的危害.我想知道，较弱的ISA如何使用存储缓冲区(否则不是FIFO)解决WAW危险，从而允许对StoreStore重新排序?他们会插入隐式障碍吗?

更具体地说，如果两个存储到同一内存地址的存储在弱ISA上依次退出，例如在ARM/POWER中，由于存储缓冲区不是FIFO，因此它们在理论上可以提交乱序缓存，从而打破了WAW依赖性.

根据维基百科:

...存储指令(包括内存地址和存储数据)被缓冲在存储队列中，直到到达退出点为止.当存储退休时，它将其值写入存储系统.这避免了WAR和WAW依赖问题...

解决方案

我的猜测；我不熟悉任何实际设计的细节

即使存储缓冲区是可以抓取"存储空间的完整调度程序，也是如此.任何要提交到L1d的毕业商店，我认为它会使用最早的第一订单.(例如，指令/uop调度程序又称为RS保留站.)

就绪"表示缓存行是排他性拥有的(修改"或排他"状态).每个毕业商店本身都隐式准备提交，因为根据定义，相关的商店指令已退出.

按顺序退货意味着商店有资格按程序顺序提交，因此您不能有一个较旧的商店被暂时从最早就绪的日程安排中隐藏.这些东西将确保对于任何给定的字节，按照程序顺序存储重叠的内容，从而保持高速缓存行内任何给定的字节组的全局可见性顺序和最终值的一致性./p>

存储屏障可以通过像杂货店结帐传送带上的分隔线一样隔开商店缓冲区来工作，从而防止在将商店提交到同一行中的同一位置时抢劫经过它的商店.

我们确实知道现实世界中弱排序的存储缓冲区，例如 PowerPC RS64-III(按顺序执行)和Alpha 21264(OoO)exec)进行合并以帮助他们创建对L1d的整个4字节或8字节对齐的提交，例如多个字节存储区中删除.假设您的合并算法尊重任何给定字节的顺序，例如:通过将来自较新商店的数据放入较旧的SB条目中(反之亦然)，并将另一个条目标记为已提交".显然，这必须遵守商店壁垒.

我认为即使没有合并的商店也可以，尽管保留合并的原子性保证可能会很棘手.(Intel P6-family和更高版本的确为未对齐但未超过缓存行边界的缓存存储提供了原子性保证，但我们认为Intel确实没有适当地合并到存储缓冲区中；也许只是一些带有LFB的东西用于缓存-错过了背靠背的商店到同一行.)

实际的硬件可能不是可以合并任何2个SB条目的完整调度程序，例如也许只是在有限的范围内，以减少一次比较的不同地址(和大小)的数量.另外，您可能仍然只能按程序顺序释放SB条目，因此它基本上可以是循环缓冲区(与RS不同).按程序顺序分配并由SB本身的布局跟踪顺序的Alloc，使内存屏障工作以及跟踪最年轻的毕业生"在哪里便宜得多.商店是.

免责声明:如果这正是真正的硬件的工作方式，则为IDK

可能的极端情况:未对齐的4字节存储到 [cache_line + 63] (跨越CL边界)，然后到 [cache_line + 60] (完全包含)在较低的缓存行中).如果较旧的存储缓冲区条目由于尚未拥有 next 缓存行而无法立即提交，但是我们拥有 cache_line ，则仍然无法如果我们依靠未发生的情况来避免WAW危害，则让年轻的商店先提交 cache_line + 60 .

因此，您可能希望行拆分的SB条目能够将数据提交到一行，但不能提交到另一行，从而允许每个位置分别以最早的优先顺序发生，而不是将2个缓存的顺序捆绑在一起线.

相关:我写了我自己的答案解释什么是存储缓冲区.我试图避免像Wikipedia那样犯错误(当商店退役时，它将其值写入内存系统"):实际上，退役只是使其有资格提交；此类商店被称为已毕业"；).

Modern CPUs use a store buffer to delay commit into cache until retirement, also avoiding WAR and WAW memory hazards. I'm wondering how weak ISAs resolve WAW hazards using the store buffer, which is otherwise not a FIFO, allowing StoreStore reordering? Do they insert an implicit barrier?

More specifically, if two stores to the same memory address retire in-order on a weak ISA, e.g. ARM/POWER, they could theoretically commit to cache out-of-order, since the store buffer is not FIFO, thus breaking the WAW dependency.

According to Wikipedia:

...the store instructions, including the memory address and store data, are buffered in a store queue until they reach the retirement point. When a store retires, it then writes its value to the memory system. This avoids the WAR and WAW dependence problems...

解决方案
My guess; I'm not familiar with the details of any real-world designs

Even if the store buffer is a full scheduler that can "grab" any graduated store for commit to L1d, I'd assume it would use an oldest-ready first order. (Like an instruction / uop scheduler aka RS Reservation Station.)

"Ready" would mean the cache line is exclusively owned (Modified or Exclusive state). Every graduated store itself is implicitly ready to commit because by definition the associated store instruction has retired.

In-order retirement means that stores become eligible for commit in program-order, so you can't have an older store that's temporarily hidden from the oldest-ready-first scheduling. Together, those things would ensure that for any given byte, stores overlapping it are in program order and thus maintain consistency of global-visibility order and final value for any given group of bytes within a cache line.

A memory barrier might work by fencing off the store buffer like a divider on a grocery-store checkout conveyor belt, preventing grabbing of stores past it while committing ones to the same place in the same line.

We do know real-world weakly-ordered store buffers like PowerPC RS64-III (in-order exec) and Alpha 21264 (OoO exec) do merging to help them create whole 4-byte or 8-byte aligned commits to L1d, e.g. out of multiple byte stores. That's also fine, assuming your merge algorithm respects order for any given byte e.g. by putting the data from a younger store into an older SB entry or vice versa and marking the other entry as "already committed". Obviously this must respect store barriers.

I think this is all fine even with unaligned stores, although preserving atomicity guarantees for unaligned stores could be tricky with merging. (Intel P6-family and later does provide atomicity guarantees for unaligned cached stores that don't cross a cache-line boundary, but we don't think Intel does merging in the store buffer proper; maybe just some stuff with LFBs for cache-miss back-to-back stores to the same line.)

It's likely that real hardware might not be a full scheduler that can merge any 2 SB entries, e.g. maybe only over limited range to reduce the amount of different addresses (and sizes) to compare at once. Also, you'd probably still only free up SB entries in program order, so it can basically be a circular buffer (unlike the RS). Alloc in program order, and having the order be tracked by the layout of the SB itself, makes it much cheaper for memory barriers to work, and to track where the youngest "graduated" store is.

Disclaimer: IDK if this is exactly how real HW works

Possible corner case: unaligned 4-byte store to [cache_line+63] (split across a CL boundary) and then to [cache_line+60] (fully contained in the lower cache line). If the older store-buffer entry can't commit right away because we don't yet own the next cache line, but we do own cache_line, we still can't let the younger store to cache_line+60 commit first, if we're depending on that not happening to avoid WAW hazards.

So you'd probably want a line-split SB entry to be able to commit the data to one line but not the other, allowing oldest-ready-first to happen for each location separately, not tying together order across 2 cache lines.

Related: I wrote my own answer explaining what a store buffer is. I tried to avoid mistakes like Wikipedia makes ("when a store retires, it writes its value to the memory system": In fact retirement just makes it eligible to commit; such stores are called "graduated" stores.)

这篇关于弱ISA如何使用存储缓冲区解决WAW内存的危害?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

弱ISA如何使用存储缓冲区解决WAW内存的危害? [英] How do weak ISAs resolve WAW memory hazards using the store buffer?

问题描述

我的猜测；我不熟悉任何实际设计的细节

免责声明:如果这正是真正的硬件的工作方式，则为IDK

My guess; I'm not familiar with the details of any real-world designs

Disclaimer: IDK if this is exactly how real HW works

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

弱ISA如何使用存储缓冲区解决WAW内存的危害? [英] How do weak ISAs resolve WAW memory hazards using the store buffer?

问题描述

我的猜测；我不熟悉任何实际设计的细节

免责声明:如果这正是真正的硬件的工作方式，则为IDK

My guess; I'm not familiar with the details of any real-world designs

Disclaimer: IDK if this is exactly how real HW works

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭