英特尔内存模型是否使 SFENCE 和 LFENCE 变得多余? [英] Does the Intel Memory Model make SFENCE and LFENCE redundant?

查看:33
本文介绍了英特尔内存模型是否使 SFENCE 和 LFENCE 变得多余?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

英特尔内存模型保证:

  • 商店不会与其他商店重新订购
  • 负载不会与其他负载重新排序

http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

我看到有人声称由于 Intel 内存模型,SFENCE 在 x86-64 上是多余的,但从来没有 LFENCE.上述内存模型规则是否使任一指令变得多余?

I have seen claims that SFENCE is redundant on x86-64 due to the Intel memory model, but never LFENCE. Do the above memory model rules make either instructions redundant?

推荐答案

对了,LFENCE 和 SFENCE 在普通代码中没有用,因为 x86 的常规存储的获取/释放语义使它们变得多余,除非您使用其他特殊指令或内存类型.

Right, LFENCE and SFENCE are not useful in normal code because x86's acquire / release semantics for regular stores make them redundant unless you're using other special instructions or memory types.

对普通无锁代码而言唯一重要的栅栏是来自 locked 指令或慢速 MFENCE 的完整屏障(包括 StoreLoad).与 mov+mfence 相比,更喜欢 xchg 用于顺序一致性存储.加载和存储是唯一被重新排序的指令吗? 因为它更快.

The only fence that matters for normal lockless code is the full barrier (including StoreLoad) from a locked instruction, or a slow MFENCE. Prefer xchg for sequential-consistency stores over mov+mfence. Are loads and stores the only instructions that gets reordered? because it's faster.

`xchg` 是否包含 `mfence` 假设没有非- 时间指令?(是的,即使使用 NT 指令,只要没有 WC 内存.)

Does `xchg` encompass `mfence` assuming no non-temporal instructions? (yes, even with NT instructions, as long as there's no WC memory.)

Jeff Preshing 的 内存重新排序陷入困境 文章是对 Bartosz 的帖子所讨论的同一个案例的更易于阅读的描述,您需要像 MFENCE 这样的 StoreLoad 屏障. 只有 MFENCE 会做;你不能用 SFENCE + LFENCE 构造出 MFENCE.(为什么是(或不是?)SFENCE + LFENCE 等效MFENCE?)

Jeff Preshing's Memory Reordering Caught in the Act article is an easier-to-read description of the same case Bartosz's post talks about, where you need a StoreLoad barrier like MFENCE. Only MFENCE will do; you can't construct MFENCE out of SFENCE + LFENCE. (Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?)

如果您在阅读您发布的链接后有任何疑问,请阅读 Jeff Preshing 的其他博文.他们让我对这个主题有了很好的理解.:) 虽然我认为我发现关于 SFENCE/LFENCE 的花絮通常在 Doug Lea 的页面中是空的.Jeff 的帖子没有考虑 NT 加载/存储.

If you had questions after reading the link you posted, read Jeff Preshing's other blog posts. They gave me a good understanding of the subject. :) Although I think I found the tidbit about SFENCE/LFENCE normally being a no-op in Doug Lea's page. Jeff's posts didn't consider NT loads/stores.

相关:当我应该使用 _mm_sfence _mm_lfence 和 _mm_mfence(我的答案和@BeeOnRope 的答案都很好.我写这个答案的时间比那个答案要长得多,所以这个答案的一部分显示了我几年前的经验不足.我在那里的答案认为C++ 内在函数和 C++ 编译时内存顺序,这与 x86 asm 运行时内存顺序完全不同.但您仍然不想要 _mm_lfence().)

Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (my answer and @BeeOnRope's answer are good. I wrote this answer a lot longer ago than that answer, so parts of this answer are showing my inexperience years ago. My answer there considers the C++ intrinsics and C++ compile-time memory order, which is not at all the same thing as x86 asm runtime memory ordering. But you still don't want _mm_lfence().)

SFENCE 仅在使用 movnt(非临时)流存储或使用类型设置为非正常回写的内存区域时才相关.或者使用 clflushopt,这有点像弱排序商店.NT 存储绕过缓存以及弱排序.x86的正常内存模型是强序的,除了NT store,WC(写组合)内存和 ERMSB 字符串操作(见下文)).

SFENCE is only relevant when using movnt (Non-Temporal) streaming stores, or working with memory regions with a type set to something other than the normal Write-Back. Or with clflushopt, which is kind of like a weakly-ordered store. NT stores bypass the cache as well as being weakly ordered. x86's normal memory model is strongly ordered, other than NT stores, WC (write-combining) memory, and ERMSB string ops (see below)).

LFENCE 仅对具有弱排序加载的内存排序有用,这种加载非常.(或者可以在 NT 商店之前使用常规加载进行 LoadStore 订购?)

LFENCE is only useful for memory ordering with weakly-ordered loads, which are very rare. (Or possible for LoadStore ordering with regular loads before NT stores?)

来自 WB 内存的 NT 加载(movntdqa)是仍然强烈订购,即使在一个不会忽略 NT 提示的假设的未来 CPU;在 x86 上进行弱排序加载的唯一方法是从弱排序内存 (WC) 读取,然后我认为只能使用 movntdqa.这不会在正常"程序中偶然发生,所以如果你映射视频 RAM 或其他东西,你只需要担心这个.

NT loads (movntdqa) from WB memory are still strongly ordered, even on a hypothetical future CPU that doesn't ignore the NT hint; the only way to do weakly-ordered loads on x86 is when reading from weakly-ordered memory (WC), and then I think only with movntdqa. This doesn't happen by accident in "normal" programs, so you only have to worry about this if you mmap video RAM or something.

(lfence 的主要用例根本不是内存排序,而是用于序列化指令执行,例如用于 Spectre 缓解或使用 RDTSC.参见 LFENCE 是否在 AMD 处理器上序列化? 以及该问题的链接问题"侧边栏.)

(The primary use-case for lfence is not memory ordering at all, it's for serializing instruction execution, e.g. for Spectre mitigation, or with RDTSC. See Is LFENCE serializing on AMD processors? and the "linked questions" sidebar for that question.)

几周前我对此感到好奇,并针对最近的一个问题发布了相当详细的答案:原子操作,std::atomic<>和写入顺序.我提供了很多关于 C++ 的内存模型与硬件内存模型的链接.

I got curious about this a couple weeks ago, and posted a fairly detailed answer to a recent question: Atomic operations, std::atomic<> and ordering of writes. I included lots of links to stuff about the memory model of C++ vs. hardware memory models.

如果您使用 C++ 编写,使用 std::atomic<> 是告诉编译器您的排序要求的极好方法,因此它不会重新排序您的内存操作编译时间.您可以并且应该在适当的情况下使用较弱的发布或获取语义,而不是默认的顺序一致性,因此编译器根本不必在 x86 上发出任何屏障指令.它只需要按照源代码顺序保持操作.

If you're writing in C++, using std::atomic<> is an excellent way to tell the compiler what ordering requirements you have, so it doesn't reorder your memory operations at compile time. You can and should use weaker release or acquire semantics where appropriate, instead of the default sequential consistency, so the compiler doesn't have to emit any barrier instructions at all on x86. It just has to keep the ops in source order.

在弱排序架构(如 ARM 或 PPC,或带有 movnt 的 x86)上,您需要在写入缓冲区和设置标志以指示数据准备好之间的 StoreStore 屏障指令.此外,阅读器在检查标志和读取缓冲区之间需要一个 LoadLoad 屏障指令.

On a weakly ordered architecture like ARM or PPC, or x86 with movnt, you need a StoreStore barrier instruction between writing a buffer and setting a flag to indicate the data is ready. Also, the reader needs a LoadLoad barrier instruction between checking the flag and reading the buffer.

不算movnt,x86在每次加载之间已经有LoadLoad屏障,在每个商店之间都有StoreStore屏障.(LoadStore 排序也是有保证的).MFENCE 是所有4种屏障,包括StoreLoad,这是x86默认不做的唯一屏障.MFENCE 确保加载不会使用其他线程看到您的存储之前的旧预取值,并可能进行自己的存储.(同时也是 NT 存储排序和加载排序的障碍.)

Not counting movnt, x86 already has LoadLoad barriers between every load, and StoreStore barriers between every store. (LoadStore ordering is also guaranteed). MFENCE is all 4 kinds of barriers, including StoreLoad, which is the only barrier x86 doesn't do by default. MFENCE makes sure loads don't use old prefetched values from before other threads saw your stores and potentially did stores of their own. (As well as being a barrier for NT store ordering and load ordering.)

有趣的事实:x86 lock 前缀指令也是全内存屏障.它们可以用作旧 32 位代码中 MFENCE 的替代品,这些代码可能在不支持它的 CPU 上运行.lock add [esp], 0 否则是空操作,并且在 L1 缓存中很可能很热并且已经处于 MESI 一致性协议的 M 状态的内存上执行读/修改/写循环.

Fun fact: x86 lock-prefixed instructions are also full memory barriers. They can be used as a substitute for MFENCE in old 32bit code that might run on CPUs not supporting it. lock add [esp], 0 is otherwise a no-op, and does the read/modify/write cycle on memory that's very likely hot in L1 cache and already in the M state of the MESI coherency protocol.

SFENCE 是 StoreStore 屏障.在 NT 存储之后为后续存储创建发布语义很有用.

SFENCE is a StoreStore barrier. It's useful after NT stores to create release semantics for a following store.

LFENCE 几乎总是与内存屏障无关,因为唯一的弱排序加载

LFENCE is almost always irrelevant as a memory barrier because the only weakly-ordered load

一个 LoadLoad 和也是一个 LoadStore 屏障.(loadNT/LFENCE/storeNT 阻止 store 在加载之前变得全局可见.我认为如果加载地址是长依赖链的结果,或者另一个加载的结果,我认为这可能在实践中发生缓存中遗漏的那个.)

a LoadLoad and also a LoadStore barrier. (loadNT / LFENCE / storeNT prevents the store from becoming globally visible before the load. I think this could happen in practice if the load address was the result of a long dependency chain, or the result of another load that missed in cache.)

有趣的事实 #2(感谢 @EOF):来自 ERMSB(IvyBridge 及更高版本上的增强rep movsb/rep stosb) 是弱排序的(但不是缓存绕过).ERMSB 建立在常规的 Fast-String Ops(从 PPro 以来一直存在的 rep stos/movsb 微编码实现中的广泛存储).

Fun fact #2 (thanks @EOF): The stores from ERMSB (Enhanced rep movsb/rep stosb on IvyBridge and later) are weakly-ordered (but not cache-bypassing). ERMSB builds on regular Fast-String Ops (wide stores from the microcoded implementation of rep stos/movsb that's been around since PPro).

英特尔在其软件开发人员手册第 1 卷的第 7.3.9.3 节中记录了 ERMSB 存储可能看起来乱序执行"这一事实.他们还说

Intel documents the fact that ERMSB stores "may appear to execute out of order" in section 7.3.9.3 of their Software Developers Manual, vol1. They also say

"顺序相关代码应该写入离散信号量变量在任何字符串操作之后,以允许看到正确排序的数据由所有处理器"

"Order-dependent code should write to a discrete semaphore variable after any string operations to allow correctly ordered data to be seen by all processors"

他们没有提到在 rep movsb 和存储到 data_ready 标志之间需要任何屏障指令.

They don't mention any barrier instructions being necessary between the rep movsb and the store to a data_ready flag.

按照我的阅读方式,在 rep stosb/rep movsb 之后有一个隐含的 SFENCE(至少是字符串数据的栅栏,可能不是其他动态弱排序的 NT 存储).无论如何,措辞意味着对标志/信号量的写入在所有字符串移动写入之后成为全局可见的,因此在使用快速字符串操作填充缓冲区的代码中不需要 SFENCE/LFENCE然后写入一个标志,或在读取它的代码中.

The way I read it, there's an implicit SFENCE after rep stosb / rep movsb (at least a fence for the string data, probably not other in-flight weakly ordered NT stores). Anyway, the wording implies that a write to the flag / semaphore becomes globally visible after all the string-move writes, so no SFENCE / LFENCE is needed in code that fills a buffer with a fast-string op and then writes a flag, or in code that reads it.

(加载加载排序总是发生,所以你总是按照其他 CPU 使其全局可见的顺序看到数据.即使用弱排序存储写入缓冲区不会改变其他线程中的加载仍然是强排序的事实.)

(LoadLoad ordering always happens, so you always see data in the order that other CPUs made it globally visible. i.e. using weakly-ordered stores to write a buffer doesn't change the fact that loads in other threads are still strongly ordered.)

summary:使用普通存储写一个标志,指示缓冲区已准备就绪.不要让读者只检查用 memset/memcpy 写入的块的最后一个字节.

summary: use a normal store to write a flag indicating that a buffer is ready. Don't have readers just check the last byte of the block written with memset/memcpy.

我还认为 ERMSB 存储会阻止任何后来的存储传递它们,因此 如果您使用 movNT,您仍然只需要 SFENCE.即 rep stosb 作为一个整体具有释放语义.之前的说明.

I also think ERMSB stores prevent any later stores from passing them, so you still only need SFENCE if you're using movNT. i.e. the rep stosb as a whole has release semantics wrt. earlier instructions.

有一个 MSR 位可以清除以禁用 ERMSB,这有利于需要运行旧二进制文件的新服务器,这些二进制文件将数据就绪"标志写入 rep stosb>rep movsb 什么的.(在那种情况下,我猜你会得到旧的快速字符串微码,它可能使用高效的缓存协议,但确实让所有存储按顺序出现在其他内核中).

There's a MSR bit that can be cleared to disable ERMSB for the benefit of new servers that need to run old binaries that writes a "data ready" flag as part of a rep stosb or rep movsb or something. (In that case I guess you get the old fast-string microcode that may use an efficient cache protocol, but does make all the stores appear to other cores in order).

这篇关于英特尔内存模型是否使 SFENCE 和 LFENCE 变得多余?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆