除了提供必要的保证外,硬件内存屏障是否可以更快地了解原子操作? [英] Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

查看:22
本文介绍了除了提供必要的保证外,硬件内存屏障是否可以更快地了解原子操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL;DR:在生产者-消费者队列中放置一个不必要的(从 C++ 内存模型的角度来看)内存栅栏或不必要的强内存顺序是否有意义以牺牲可能更差的吞吐量来获得更好的延迟?

<小时>

C++ 内存模型是在硬件上执行的,方法是为更强的内存顺序设置某种内存栅栏,而不是在较弱的内存顺序上设置它们.

特别是,如果生产者做了store(memory_order_release),而消费者用load(memory_order_acquire)观察存储的值,那么load和store之间就没有栅栏了.在 x86 上根本没有栅栏,在 ARM 上栅栏是在存储之前和加载之后进行放置操作.

没有围栏存储的值最终会被没有围栏的加载观察到(可能在几次不成功的尝试之后)

我想知道在队列的两侧放置围栏是否可以更快地观察到值?如果是,有和没有围栏的延迟是多少?

我希望只有一个带有 load(memory_order_acquire)pause/yield 的循环限制为数千次迭代是最好的选择,因为它无处不在,但想了解原因.

由于这个问题是关于硬件行为的,我希望没有通用的答案.如果是这样,我主要想知道 x86(x64 风格),其次是 ARM.

<小时>

示例:

T 队列[MAX_SIZE]std::atomicshared_producer_index;无效的生产者(){std::size_t private_producer_index = 0;为了(;;){private_producer_index++;//处理翻转和队列已满省略/* 填充数据 */;shared_producer_index.store(private_producer_index, std::memory_order_release);//可能是这里的障碍或上面的更强的顺序?}}无效消费者(){std::size_t private_consumer_index = 0;为了(;;){std::size_t Observed_producer_index = shared_producer_index.load(std::memory_order_acquire);而(private_consumer_index == Observed_producer_index){//可能是这里的障碍或下面的更强的顺序?_mm_pause();观察到的生产者索引=共享生产者索引.加载(std::memory_order_acquire);//省略一些迭代后从忙等待切换到内核等待}/* 消耗与索引差异指定的一样多的数据 */;private_consumer_index = Observed_producer_index;}}

解决方案

基本上对内核间延迟没有显着影响,并且绝对不值得在没有仔细分析的情况下盲目"使用,如果你怀疑有可能是缓存中丢失的后续加载引起的任何争用.

认为需要 asm 屏障才能使存储缓冲区提交到缓存是一种常见的误解.事实上,障碍只是让这个核心等待已经发生的事情自己,然后再做以后的加载和/或存储.对于完全屏障,阻塞以后的加载和存储,直到存储缓冲区耗尽.英特尔上的存储缓冲区大小硬件?究竟什么是存储缓冲区?

std::atomic 之前的糟糕过去,编译器屏障是阻止编译器将值保存在寄存器中的一种方法(CPU 内核/线程私有,不连贯),但这是一个编译问题,而不是 asm.具有非一致性缓存的 CPU 理论上是可能的(其中 std::atomic 需要进行显式刷新以使存储可见),但是 实际上,没有实现在具有非一致性缓存的内核之间运行 std::thread.

<小时>

如果我不使用围栏,一个核心需要多长时间才能看到另一个核心的写入? 高度相关,我之前至少写过几次这个答案.(但这看起来是一个专门回答这个问题的好地方,而不必涉及哪些障碍做什么的杂草.)

<小时>

阻止可能与 RFO 竞争的后续加载可能会产生一些非常小的次要影响(让该核心获得对缓存行的独占访问权限以提交存储).CPU 总是尝试尽可能快地耗尽存储缓冲区(通过提交到 L1d 缓存).一旦存储提交到 L1d 缓存,它就会对所有其他内核全局可见.(因为他们是连贯的;他们仍然需要提出共享请求......)

让当前内核将一些存储数据回写到 L3 缓存(尤其是在共享状态下)可以减少未命中惩罚,如果另一个内核上的负载在此存储提交后发生了一些.但是没有好的方法可以做到这一点.在 L1d 和 L2 中创建冲突,如果生产者性能良好除了为下一次读取创建低延迟之外,其他不重要.

在 x86 上,英特尔 Tremont(低功耗 Silvermont 系列)将介绍 cldemote (_mm_cldemote) 将一行写回至外部缓存,但不会一直写回 DRAM.(clwb 可能会有所帮助,但确实会强制存储一直使用 DRAM.此外,Skylake 实现只是一个占位符,其工作方式类似于 clflushopt.)

有趣的事实:PowerPC 上的非 seq_cst 存储/加载可以在同一物理核心上的逻辑核心之间进行存储转发,使存储对 某些 其他核心可见,然后对 全局可见所有 其他核心.这是 AFAIK 线程不同意所有对象的全局存储顺序的唯一真正的硬件机制.将两个其他线程总是以相同的顺序看到对不同线程中不同位置的原子写入?.在包括 ARMv8 和 x86 在内的其他 ISA 上,可以保证存储同时对所有其他内核可见(通过提交到 L1d 缓存).

<小时>

对于负载,CPU 已经将需求负载优先于任何其他内存访问(因为当然执行必须等待它们.)负载之前的障碍只能延迟它.

由于时间的巧合,这可能是最佳的,如果这使它看到它正在等待的商店而不是太快"并看到旧的缓存无聊值.但通常没有理由假设或预测 pause 或屏障在加载之前可能是一个好主意.

负载后的屏障也不应该有帮助.稍后的加载或存储可能能够启动,但无序 CPU 通常以最旧的优先级执行任务,因此在此加载有机会获得其加载请求之前,稍后的加载可能无法填满所有未完成的加载缓冲区非核心发送(假设缓存未命中,因为最近存储了另一个核心.)

我想我可以想象如果这个加载地址在一段时间内没有准备好(指针追逐情况)并且当地址确实成为时已经在进行中的最大核外请求数,我可以想象一个好处已知.

任何可能的好处几乎肯定不值得;如果有那么多独立于这个负载的有用工作,它可以填满所有的非核心请求缓冲区(英特尔的 LFB),那么它很可能不在关键路径上,让这些负载运行可能是一件好事.

TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput?


C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders.

In particular, if producer does store(memory_order_release), and consumer observes the stored value with load(memory_order_acquire), there are no fences between load and store. On x86 there are no fences at all, on ARM fences are put operation before store and after load.

The value stored without a fence will eventually be observed by load without a fence (possibly after few unsuccessful attempts)

I'm wondering if putting a fence on either of sides of the queue can make the value to be observed faster? What is the latency with and without fence, if so?

I expect that just having a loop with load(memory_order_acquire) and pause / yield limited to thousands of iterations is the best option, as it is used everywhere, but want to understand why.

Since this question is about hardware behavior, I expect there's no generic answer. If so, I'm wondering mostly about x86 (x64 flavor), and secondarily about ARM.


Example:

T queue[MAX_SIZE]

std::atomic<std::size_t>   shared_producer_index;

void producer()
{
   std::size_t private_producer_index = 0;

   for(;;)
   {
       private_producer_index++;  // Handling rollover and queue full omitted

       /* fill data */;

      shared_producer_index.store(
          private_producer_index, std::memory_order_release);
      // Maybe barrier here or stronger order above?
   }
}


void consumer()
{
   std::size_t private_consumer_index = 0;

   for(;;)
   {
       std::size_t observed_producer_index = shared_producer_index.load(
          std::memory_order_acquire);

       while (private_consumer_index == observed_producer_index)
       {
           // Maybe barrier here or stronger order below?
          _mm_pause();
          observed_producer_index= shared_producer_index.load(
             std::memory_order_acquire);
          // Switching from busy wait to kernel wait after some iterations omitted
       }

       /* consume as much data as index difference specifies */;

       private_consumer_index = observed_producer_index;
   }
}

解决方案

Basically no significant effect on inter-core latency, and definitely never worth using "blindly" without careful profiling, if you suspect there might be any contention from later loads missing in cache.

It's a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait for something that was already going to happen on its own, before doing later loads and/or stores. For a full barrier, blocking later loads and stores until the store buffer is drained. Size of store buffers on Intel hardware? What exactly is a store buffer?

In the bad old days before std::atomic, compiler barriers were one way to stop the compiler from keeping values in registers (private to a CPU core / thread, not coherent), but that's a compilation issue not asm. CPUs with non-coherent caches are possible in theory (where std::atomic would need to do explicit flushing to make a store visible), but in practice no implementation runs std::thread across cores with non-coherent caches.


If I don't use fences, how long could it take a core to see another core's writes? is highly related, I've written basically this answer at least a few times before. (But this looks like a good place for an answer specifically about this, without getting into the weeds of which barriers do what.)


There might be some very minor secondary effects of blocking later loads that could maybe compete with RFOs (for this core to get exclusive access to a cache line to commit a store). The CPU always tries to drain the store buffer as fast as possible (by committing to L1d cache). As soon as a store commits to L1d cache, it becomes globally visible to all other cores. (Because they're coherent; they'd still have to make a share request...)

Getting the current core to write-back some store data to L3 cache (especially in shared state) could reduce the miss penalty if the load on another core happens somewhat after this store commits. But there are no good ways to do that. Creating a conflict miss in L1d and L2 maybe, if producer performance is unimportant other than creating low latency for the next read.

On x86, Intel Tremont (low power Silvermont series) will introduce cldemote (_mm_cldemote) that writes back a line as far as an outer cache, but not all the way to DRAM. (clwb could possibly help, but does force the store to go all the way to DRAM. Also, the Skylake implementation is just a placeholder and works like clflushopt.)

Fun fact: non-seq_cst stores/loads on PowerPC can store-forward between logical cores on the same physical core, making stores visible to some other cores before they become globally visible to all other cores. This is AFAIK the only real hardware mechanism for threads to not agree on a global order of stores to all objects. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. On other ISAs, including ARMv8 and x86, it's guaranteed that stores become visible to all other cores at the same time (via commit to L1d cache).


For loads, CPUs already prioritize demand loads over any other memory accesses (because of course execution has to wait for them.) A barrier before a load could only delay it.

That might happen to be optimal by coincidence of timing, if that makes it see the store it was waiting for instead of going "too soon" and seeing the old cached boring value. But there's generally no reason to assume or ever predict that a pause or barrier could be a good idea ahead of a load.

A barrier after a load shouldn't help either. Later loads or stores might be able to start, but out-of-order CPUs generally do stuff in oldest-first priority so later loads probably can't fill up all the outstanding load buffers before this load gets a chance to get its load request sent off-core (assuming a cache miss because another core stored recently.)

I guess I could imagine a benefit to a later barrier if this load address wasn't ready for a while (pointer-chasing situation) and the max number of off-core requests were already in-flight when the address did become known.

Any possible benefit is almost certainly not worth it; if there was that much useful work independent of this load that it could fill up all the off-core request buffers (LFBs on Intel) then it might well not be on the critical path and it's probably a good thing to have those loads in flight.

这篇关于除了提供必要的保证外,硬件内存屏障是否可以更快地了解原子操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆