如何通过微体系结构实现障碍/栅栏以及获取,释放语义? [英] how are barriers/fences and acquire, release semantics implemented microarchitecturally?

查看:79
本文介绍了如何通过微体系结构实现障碍/栅栏以及获取,释放语义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很多问题,例如 https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/及其整个系列文章从不同的障碍类型提供的顺序和可见性保证的角度抽象地讨论了内存顺序.我的问题是,如何在x86和ARM微体系结构上实现这些障碍和内存排序语义?

A lot of questions SO and articles/books such as https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2018.12.08a.pdf, Preshing's articles such as https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and his entire series of articles, talk about memory ordering abstractly, in terms of the ordering and visibility guarantees provided by different barriers types. My question is how are these barriers and memory ordering semantics implemented on x86 and ARM micro architecturally ?

对于商店-商店壁垒,好像在x86上,商店缓冲区维护商店的程序顺序并将其提交给L1D(因此使它们以相同的顺序在全局可见).如果未对存储缓冲区进行排序,即未按程序顺序维护它们,那么如何实现存储障碍?它只是以标记"存储缓冲区的方式,即在屏障提交之前将存储提交到缓存一致性域,然后在屏障之后存储?还是内存屏障实际上刷新了存储缓冲区并暂停了所有指令,直到刷新完成?可以同时实现吗?

For store-store barriers, it seems like on the x86, the store buffer maintains program order of stores and commits them to L1D(and hence making them globally visible in the same order). If the store buffer is not ordered, ie does not maintain them in program order, how is a store store barrier implemented ? it is just "marking" the store buffer in such a way that that stores before barrier commit to the cache coherent domain before stores after ? or does the memory barrier actually flush the store buffer and stall all instructions until the flushing is complete ? Could it be implemented both ways ?

对于负载-负载屏障,如何防止负载-负载重新排序?很难相信x86将按顺序执行所有加载!我假设加载可以无序执行,但可以按顺序提交/退出.如果是这样,如果一个cpu在2个不同的位置执行2次加载,那么一个加载如何确保它从T100中得到一个值,而下一个加载在T100上或之后得到它?如果第一个负载未命中高速缓存并正在等待数据,而第二个负载命中并获取其值,该怎么办.当加载1获得其值时,如何确保它获得的值不是来自该加载2的值的较新商店?如果负载可以无序执行,那么如何检测到违反内存排序的问题?

For load-load barriers, how is load-load reordering prevented ? It is hard to believe that x86 will execute all loads in order! I assume loads can execute out of order but commit/retire in order. If so, if a cpu executes 2 loads to 2 different locations ,how does one load ensure that it got a value from say T100 and the next one got it on or after T100 ? What if the first load misses in the cache and is waiting for data and the second load hits and gets its value. When load 1 gets its value how does it ensure that the value it got is not from a newer store that load 2's value ? if the loads can execute out of order, how are violations to memory ordering detected ?

类似地,如何实现负载存储屏障(在x86的所有负载中都是隐含的)以及如何实现存储负载屏障(例如mfence)?即dmb ld/st和dmb指令在ARM上是如何微体系结构的?每个加载和每个存储都在x86上进行微体系结构来确保内存排序吗?

Similarly how are load-store barriers(implicit in all loads for x86) implemented and how are store-load barriers(such as mfence) implemented ? ie what do the dmb ld/st and just dmb instructions do micro-architecturally on ARM, and what does every load and every store, and the mfence instruction do micro-architecturally on x86 to ensure memory ordering ?

推荐答案

其中的许多内容已在其他Q& As中进行了介绍,但在此我将给出一个摘要.(并查找要添加的链接).不过,很好的问题是,将所有这些收集在一个地方还是很有用的.

Much of this has been covered in other Q&As, but I'll give a summary here. (And look for links to add). Still, good question, it's useful to collect this all in one place.

在x86上,每个asm负载都是一个获取负载.为了有效地实现这一点,现代的x86硬件推测性加载的时间比允许的时间早,然后检查该推测.(可能会导致内存顺序错误推测流水线.)为了跟踪这一点,英特尔将加载和存储缓冲区的组合称为内存顺序缓冲区".

On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW speculatively loads earlier than allowed and then checks that speculation. (Potentially resulting in a memory-order mis-speculation pipeline nuke.) To track this, Intel calls the combination of load and store buffers the "Memory Order Buffer".

无序的ISA不必推测,它们可以按任何顺序加载.

Weakly-ordered ISAs don't have to speculate, they can just load in any order.

x86商店订购是通过仅允许商店按照程序顺序从商店缓冲区提交到L1d来维护的.

x86 store ordering is maintained by only letting stores commit from the store buffer to L1d in program order.

至少在Intel CPU上,发布存储时(从前端到ROB + RS)为存储分配.所有uops都需要为其分配一个ROB条目,但是某些uops还需要分配其他资源,例如装入或存储缓冲区条目,用于它们读/写的寄存器的RAT条目等等.

On Intel CPUs at least, a store-buffer entry is allocated for a store when it issues (from the front-end into the ROB + RS). All uops need to have a ROB entry allocated for them, but some uops also need to have other resources allocated, like load or store buffer entries, RAT entries for registers they read/write, and so on.

因此,我认为存储缓冲区本身是 .当执行存储地址或存储数据uop时,它仅将地址或数据写入其已分配的存储缓冲区条目中.由于commit(释放SB条目)和分配都按程序顺序进行,因此我假设它实际上是一个带有头和尾的循环缓冲区,就像ROB一样.(与RS不同).

So I think the store buffer itself is ordered. When a store-address or store-data uop executes, it merely writes an address or data into its already-allocated store-buffer entry. Since commit (freeing SB entries) and allocate are both in program order, I assume it's physically a circular buffer with a head and tail, like the ROB. (And unlike the RS).

避免LoadStore基本上是免费的:只有在执行加载(从缓存中获取数据)后,该加载才能退出.商店要在之后退休后才能提交.自动按顺序退货意味着所有以前的加载都在商店毕业"并准备提交之前完成.

Avoiding LoadStore is basically free: a load can't retire until it's executed (taken data from the cache). A store can't commit until after it retires. In-order retirement automatically means that all previous loads are done before a store is "graduated" and ready for commit.

在实践中可以进行负载存储重新排序的微弱uarch可能会对计分板负载进行记分:一旦它们不是非故障状态,而是在数据到达之前,让它们退役.

A weakly-ordered uarch that can in practice do load-store reordering might scoreboard loads: let them retire once they're not to be non-faulting but before the data arrives.

这似乎更可能出现在有序内核上,但可能是IDK.因此,您可能已经退休了,但是如果在实际到达数据之前尝试读取它,则寄存器目标仍然会停顿.我们知道,有序内核实际上是以这种方式工作的,不需要先加载负载即可完成以后的指令.(这就是为什么在这样的内核上使用大量寄存器进行软件流水处理如此有价值的原因,例如实现一个memcpy.立即在有序内核上读取加载结果会破坏内存并行性.)

This seems more likely on an in-order core, but IDK. So you could have a load that's retired but the register destination will still stall if anything tries to read it before the data actually arrives. We know that in-order cores do in practice work this way, not requiring loads to complete before later instructions can execute. (That's why software-pipelining using lots of registers is so valuable on such cores, e.g. to implement a memcpy. Reading a load result right away on an in-order core destroys memory parallelism.)

如何进行load->商店重新排序是按顺序提交还是顺序提交?

How is load->store reordering possible with in-order commit? goes into this more deeply, for in-order vs. out-of-order.

对常规存储执行任何操作的唯一屏障指令是 mfence ,它实际上会暂停内存操作(或整个管道),直到耗尽存储缓冲区为止.是否加载并存储唯一需要重新排序的指令?还介绍了Skylake-with-updated-microcode行为,其行为类似于 lfence .

The only barrier instruction that does anything for regular stores is mfence which in practice stalls memory ops (or the whole pipeline) until the store buffer is drained. Are loads and stores the only instructions that gets reordered? covers the Skylake-with-updated-microcode behaviour of acting like lfence as well.

lfence 主要用于阻止以后的指令甚至发布,直到所有先前的指令都离开无序的后端(退休)的微体系结构效果.内存排序 fence 的用例几乎不存在.

lfence mostly exists for the microarchitectural effect of blocking later instructions from even issuing until all previous instructions have left the out-of-order back-end (retired). The use-cases for lfence fo memory ordering are nearly non-existent.

相关:

  • How many memory barriers instructions does an x86 CPU have?
  • How can I experience "LFENCE or SFENCE can not pass earlier read/write"
  • Does lock xchg have the same behavior as mfence?
  • Does the Intel Memory Model make SFENCE and LFENCE redundant?
  • Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths goes into a lot of detail about how LFENCE stops execution of later instructions, and what that means for performance.
  • When should I use _mm_sfence _mm_lfence and _mm_mfence high-level languages have weaker memory models than x86, so you sometimes only need a barrier that compiles to no asm instructions. Using _mm_sfence() when you haven't used any NT stores just makes your code slower for no reason than atomic_thread_fence(mo_release).

这篇关于如何通过微体系结构实现障碍/栅栏以及获取,释放语义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆