内存防护是否会阻塞多核CPU中的线程? [英] Does memory fencing blocks threads in multi-core CPUs?

查看:84
本文介绍了内存防护是否会阻塞多核CPU中的线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Intel指令集指南64-ia-32 解决方案

障碍不会使 other 线程/内核等待.它们使当前线程中的某些操作处于等待状态,具体取决于它的屏障类型.非内存指令的乱序执行不一定会被阻止.

障碍甚至无法使您的加载/存储对其他线程更快地可见;CPU内核已经从 mfence (或 lock ed操作 lock add xchg [mem],reg )使 current 线程中的所有以后的加载/存储都等到之前的所有加载和存储完成并在全局可见(即清除存储缓冲区).

Skylake上的

mfence 以一种使整个内核停滞直到存储缓冲区耗尽的方式实现.看到我的答案是加载并存储唯一的指令是否需要重新排序?以获取详细信息;这种额外的减速是为了解决错误.但是 lock 操作和 xchg 在Skylake上不一样;它们是完整的内存屏障,但是它们仍然允许乱序执行 imul eax,edx ,因此我们有证据表明它们不会使整个内核停滞不前.

通过超线程,我认为这种停滞发生在每个逻辑线程上,而不是整个核心上.

但是请注意, mfence 手动输入并没有说明停止内核,因此将来的x86实现可以自由地提高其效率(例如 lock或dword [rsp],0 ),并且仅阻止以后的加载读取L1d缓存,而不会阻止以后的非加载指令.


sfence 仅如果有任何NT商店正在销售它根本不排序加载,因此不必停止后续指令的执行.参见为什么(或不是)?SFENCE + LFENCE是否等于MFENCE?.

它只是在存储缓冲区中放置一个屏障,以阻止NT存储区对其重新排序,并强制在 sfence 屏障离开存储缓冲区之前,较早的NT存储区在全局范围内可见.(即写合并缓冲区必须刷新).但是它可以在到达存储缓冲区的末尾之前从核心的无序执行部分(ROB或ReOrder Buffer)中退出.)

另请参见执行内存屏障确保缓存一致性已经完成?


lfence 几乎没有用:它仅阻止WC存储器中的 movntdqa 加载与以后的加载/存储重新排序.您几乎永远不需要它.

lfence 的实际用例主要涉及其Intel(但不是AMD)行为,该行为直到其本身退出使用才允许以后的指令执行.(因此,英特尔CPU上的 lfence; rdtsc 可以避免 rdtsc 太早读取时钟,这是 cpuid; rdtsc 的更便宜的替代品)/p>

fence 的另一个重要的近期用例是阻止推测执行(例如在条件分支或间接分支之前),以减轻Spectre的负担.这完全基于Intel保证的部分序列化的副作用,与它的LoadLoad + LoadStore屏障效果无关.

lfence 不必 等待存储缓冲区耗尽才可以从ROB退出,因此LFENCE + SFENCE的组合没有像MFENCE那样强大.为什么(或不是)SFENCE + LFENCE等于MFENCE?


相关: http://preshing.com/20120625/memory-ordering-at-compile-time/.

I was reading the Intel instruction set guide 64-ia-32 guide to get an idea on memory fences. My question is that for an example with SFENCE, in order to make sure that all store operations are globally visible, does the multi-core CPU parks all the threads even running on other cores till the cache coherence achieved ?

解决方案

Barriers don't make other threads/cores wait. They make some operations in the current thread wait, depending on what kind of barrier it is. Out-of-order execution of non-memory instructions isn't necessarily blocked.

Barriers don't even make your loads/stores visible to other threads any faster; CPU cores already commit (retired) stores from the store buffer to L1d cache as fast as possible. (After all the necessary MESI coherency rules have been followed, and x86's strong memory model only allows stores to commit in program order even without barriers).

Barriers don't necessarily order instruction execution, they order global visibility, i.e. what comes out the far end of the store buffer.


mfence (or a locked operation like lock add or xchg [mem], reg) makes all later loads/stores in the current thread wait until all previous loads and stores are completed and globally visible (i.e. the store buffer is flushed).

mfence on Skylake is implemented in a way that stalls the whole core until the store buffer drains. See my answer on Are loads and stores the only instructions that gets reordered? for details; this extra slowdown was to fix an erratum. But locked operations and xchg aren't like that on Skylake; they're full memory barriers but they still allow out-of-order execution of imul eax, edx, so we have proof that they don't stall the whole core.

With hyperthreading, I think this stalling happens per logical thread, not the whole core.

But note that the mfence manual entry doesn't say anything about stalling the core, so future x86 implementations are free to make it more efficient (like a lock or dword [rsp], 0), and only prevent later loads from reading L1d cache without blocking later non-load instructions.


sfence only does anything if there are any NT stores in flight. It doesn't order loads at all, so it doesn't have to stop later instructions from executing. See Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?.

It just places a barrier in the store buffer that stops NT stores from reordering across it, and forces earlier NT stores to be globally visible before the sfence barrier can leave the store buffer. (i.e. write-combining buffers have to flush). But it can already have retired from the out-of-order execution part of the core (the ROB, or ReOrder Buffer) before it reaches the end of the store buffer.)

See also Does a memory barrier ensure that the cache coherence has been completed?


lfence as a memory barrier is nearly useless: it only prevents movntdqa loads from WC memory from reordering with later loads/stores. You almost never need that.

The actual use-cases for lfence mostly involve its Intel (but not AMD) behaviour that it doesn't allow later instructions to execute until it itself has retired. (so lfence; rdtsc on Intel CPUs lets you avoid having rdtsc read the clock too soon, as a cheaper alternative to cpuid; rdtsc)

Another important recent use-case for lfence is to block speculative execution (e.g. before a conditional or indirect branch), for Spectre mitigation. This is completely based on its Intel-guaranteed side effect of being partially serializing, and has nothing to do with its LoadLoad + LoadStore barrier effect.

lfence does not have to wait for the store buffer to drain before it can retire from the ROB, so no combination of LFENCE + SFENCE is as strong as MFENCE. Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?


Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (when writing in C++ instead of asm).

Note that the C++ intrinsics like _mm_sfence also block compile-time memory ordering. This is often necessary even when the asm instruction itself isn't, because C++ compile-time reordering happens based on C++'s very weak memory model, not the strong x86 memory model which applies to the compiler-generated asm.

So _mm_sfence may make your code work, but unless you're using NT stores it's overkill. A more efficient option would be std::atomic_thread_fence(std::memory_order_release) (which turns into zero instructions, just a compiler barrier.) See http://preshing.com/20120625/memory-ordering-at-compile-time/.

这篇关于内存防护是否会阻塞多核CPU中的线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆