如果我不使用围栏,一个核心需要多长时间才能看到另一个核心的写入? [英] If I don't use fences, how long could it take a core to see another core's writes?

查看:12
本文介绍了如果我不使用围栏,一个核心需要多长时间才能看到另一个核心的写入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试用 Google 搜索我的问题,但老实说我不知道​​如何简洁地陈述问题.

假设我在多核 Intel 系统中有两个线程.这些线程在同一个 NUMA 节点上运行.假设线程 1 写入 X 一次,然后只是偶尔向前读取它.进一步假设,除其他外,线程 2 连续读取 X.如果我不使用内存栅栏,线程 1 写入 X 和线程 2 看到更新的值之间可以间隔多长时间?

我知道 X 的写入将进入存储缓冲区,然后从那里到达缓存,此时 MESIF 将启动,线程 2 将通过 QPI 看到更新的值.(或者至少这是我收集到的).我假设存储缓冲区会在存储围栏上或如果需要重用该存储缓冲区条目时写入缓存,但我不知道存储缓冲区是否被分配给写入.

最终,我试图为自己回答的问题是,在一个正在执行其他工作的相当复杂的应用程序中,线程 2 是否有可能在几秒钟内看不到线程 1 的写入.

解决方案

内存障碍不会让其他线程任何更快地看到您的商店.(除了稍后阻塞加载可以稍微减少提交缓冲存储的争用.)

存储缓冲区总是尝试尽快将退休(已知的非推测性)存储提交到 L1d 缓存.缓存是一致的1,因此由于 MESI/MESIF/MOESI 使它们全局可见.存储缓冲区未设计作为适当的缓存或写入组合缓冲区(尽管它可以将背靠背存储组合到同一缓存线),因此它需要清空自己以腾出空间给新存储.与缓存不同的是,它想让自己保持空,而不是满.

注意 1:不仅仅是 x86;我们可以在其内核上运行单个 Linux 实例的任何 ISA 的所有多核系统都必须是缓存一致的;Linux 依靠 volatile 来实现其手动操作的原子使数据可见.同样,带有 mo_relaxed 的 C++ std::atomic 加载/存储操作只是普通 CPU 上的普通 asm 加载和存储,依赖硬件实现内核之间的可见性,而不是手动冲洗.何时在多线程中使用 volatile?.有一些集群或混合微控制器 + DSP ARM 板具有非一致性共享内存,但我们不会在不同的一致性域中运行同一进程的线程.相反,您在每个集群节点上运行一个单独的操作系统实例.我不知道任何 C++ 实现,其中 atomic 加载/存储包括手动刷新指令.(如果有的话请告诉我.)


栅栏/屏障通过让当前线程等待来工作

...直到通过正常机制发生所需的可见性.

完整屏障(mfencelocked 操作)的一个简单实现是停止管道直到存储缓冲区耗尽,但高性能实现可以做到更好,并允许与内存顺序限制分开的乱序执行.

(不幸的是 Skylake 的 mfence 确实完全阻止了乱序执行,以修复涉及从 WC 内存加载 NT 的模糊 SKL079 错误.但是 lock addxchg 或任何只会阻止以后读取 L1d 或存储缓冲区的加载,直到屏障到达存储缓冲区的末尾.而且 mfence 在早期 CPU 上大概也没有这个问题.)>


通常在非 x86 架构上(它们具有针对较弱内存屏障的显式 asm 指令,例如 仅 StoreStore 栅栏而不关心负载),原理是一样的:阻塞任何它需要阻塞的操作,直到这个内核完成任何类型的早期操作.

相关:


<块引用>

最终我想为自己回答的问题是线程 2 是否有可能在几秒钟内看不到线程 1 的写入

不,最坏情况的延迟可能类似于存储缓冲区长度(Skylake 上的 56 个条目,高于 BDW 中的 42 个)倍缓存未命中延迟,因为 x86 的强大内存模型(无 StoreStore 重新排序)要求存储按顺序提交.但是多个缓存行的 RFO 可以同时运行,因此最大延迟可能是它的 1/5(保守估计:有 10 个行填充缓冲区).也可能存在来自飞行中的负载(或来自其他核心)的争用,但我们只需要一个数量级的粗略数字.

假设 RFO 延迟(DRAM 或来自另一个内核)在 3GHz CPU 上为 300 个时钟周期(基本构成).因此,一个存储变得全局可见的最坏情况延迟可能类似于 300 * 56/5 = 3360 个核心时钟周期.因此,在我们假设的 3GHz CPU 上,在一个数量级内,最坏情况约为 1 微秒.(CPU 频率抵消了,因此以纳秒为单位的 RFO 延迟估计会更有用).

那时所有您的商店需要等待很长时间才能获得 RFO,因为它们全部都位于未缓存或由其他核心拥有的位置.并且它们中没有一个是背靠背的同一个缓存行,因此没有一个可以在存储缓冲区中合并.所以通常你会期望它快得多.

我认为没有任何合理的机制可以让它花费 100 微秒,更不用说一整秒了.

如果您的所有存储都缓存其他内核都争用同一行访问权的行,那么您的 RFO 可能需要比平时更长的时间,因此可能需要几十微秒,甚至可能是一百微秒.但这种绝对最坏的情况不会偶然发生.

I have been trying to Google my question but I honestly don't know how to succinctly state the question.

Suppose I have two threads in a multi-core Intel system. These threads are running on the same NUMA node. Suppose thread 1 writes to X once, then only reads it occasionally moving forward. Suppose further that, among other things, thread 2 reads X continuously. If I don't use a memory fence, how long could it be between thread 1 writing X and thread 2 seeing the updated value?

I understand that the write of X will go to the store buffer and from there to the cache, at which point MESIF will kick in and thread 2 will see the updated value via QPI. (Or at least this is what I've gleaned). I presume that the store buffer would get written to the cache either on a store fence or if that store buffer entry needs to be reused, but I don't know store buffers get allocated to writes.

Ultimately the question I'm trying to answer for myself is if it is possible for thread 2 to not see thread 1's write for several seconds in a fairly complicated application that is doing other work.

解决方案

Memory barriers don't make other threads see your stores any faster. (Except that blocking later loads could slightly reduce contention for committing buffered stores.)

The store buffer always tries to commit retired (known non-speculative) stores to L1d cache as fast as possible. Cache is coherent1, so that makes them globally visible because of MESI/MESIF/MOESI. The store buffer is not designed as a proper cache or write-combining buffer (although it can combine back-to-back stores to the same cache line), so it needs to empty itself to make room for new stores. Unlike a cache, it wants to keep itself empty, not full.

Note 1: not just x86; all multi-core systems of any ISA where we can run a single instance of Linux across its cores are necessarily cache coherent; Linux relies on volatile for its hand-rolled atomics to make data visible. And similarly, C++ std::atomic load/store operations with mo_relaxed are just plain asm loads and stores on all normal CPUs, relying on hardware for visibility between cores, not manual flushing. When to use volatile with multi threading? explains th. There are some clusters, or hybrid microcontroller+DSP ARM boards with non-coherent shared memory, but we don't run threads of the same process across separate coherency domains. Instead, you run a separate OS instance on each cluster node. I'm not aware of any C++ implementation where atomic<T> loads/stores include manual flush instructions. (Please let me know if there are any.)


Fences/barriers work by making the current thread wait

... until whatever visibility is required has happened via the normal mechanisms.

A simple implementation of a full barrier (mfence or a locked operation) is to stall the pipeline until the store buffer drains, but high-performance implementations can do better and allow out-of-order execution separately from the memory-order restriction.

(Unfortunately Skylake's mfence does fully block out-of-order execution, to fix the obscure SKL079 erratum involving NT loads from WC memory. But lock add or xchg or whatever only block later loads from reading L1d or the store buffer until the barrier reaches the end of the store buffer. And mfence on earlier CPUs presumably also doesn't have that problem.)


In general on non-x86 architectures (which have explicit asm instructions for weaker memory barriers, like only StoreStore fences without caring about loads), the principle is the same: block whichever operations it needs to block until this core has completed earlier operations of whatever type.

Related:


Ultimately the question I'm trying to answer for myself is if it is possible for thread 2 to not see thread 1's write for several seconds

No, the worst-case latency is maybe something like store-buffer length (56 entries on Skylake, up from 42 in BDW) times cache-miss latency, because x86's strong memory model (no StoreStore reordering) requires stores to commit in-order. But RFOs for multiple cache lines can be in flight at once, so the max delay is maybe 1/5th of that (conservative estimate: there are 10 Line Fill Buffers). There can also be contention from loads also in flight (or from other cores), but we just want an order of magnitude back-of-the-envelope number.

Lets say RFO latency (DRAM or from another core) is 300 clock cycles (basically made up) on a 3GHz CPU. So a worst-case delay for a store to become globally visible is maybe something like 300 * 56 / 5 = 3360 core clock cycles. So within an order of magnitude, worst case is about ~1 microsecond on the 3GHz CPU we're assuming. (CPU frequency cancels out, so an estimate of RFO latency in nanoseconds would have been more useful).

That's when all your stores need to wait a long time for RFOs, because they're all to locations that are uncached or owned by other cores. And none of them are to the same cache line back-to-back so none can merge in the store buffer. So normally you'd expect it to be significantly faster.

I don't think there's any plausible mechanism for it to take even a hundred microseconds, let alone a whole second.

If all your stores are to cache lines where other cores are all contending for access to the same line, your RFOs could take longer than normal, so maybe tens of microseconds, maybe even a hundred. But that kind of absolute worst case wouldn't happen by accident.

这篇关于如果我不使用围栏,一个核心需要多长时间才能看到另一个核心的写入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆