除了提供必要的保证之外,硬件内存屏障还可以使原子操作的可见性更快吗? [英] Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

查看:83
本文介绍了除了提供必要的保证之外,硬件内存屏障还可以使原子操作的可见性更快吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL; DR:在生产者-消费者队列中,是否有必要放置不必要的(从C ++内存模型的角度来看)内存围栏,或不必要的强内存顺序,以得到更好的延迟,而代价是吞吐量可能会变差?/p>


C ++内存模型是在硬件上执行的,它具有某种形式的内存围墙以实现更强的内存顺序,而不是将它们置于较弱的内存顺序上.

特别是,如果生产者执行 store(memory_order_release),并且消费者使用 load(memory_order_acquire)观察存储的值,那么在加载和存储之间就不会有任何障碍.在x86上根本没有栅栏,在ARM栅栏上存储操作和装载后都进行了操作.

没有围栏的情况下存储的值最终将在没有围栏的负载下观察到(可能是在几次尝试失败之后)

我想知道是否在队列的两边都设置了围栏可以使观察值更快?有栅栏和无栅栏的延迟时间是多少?

我希望只有一个循环,其中将 load(memory_order_acquire) pause / yield 限制为数千次迭代是最好的选择,因为它在每个地方都使用过,但是想了解为什么.

由于这个问题是关于硬件行为的,所以我希望没有通用的答案.如果是这样,我想知道的主要是x86(x64风格),其次是ARM.


示例:

  T队列[MAX_SIZE]std :: atomic< std :: size_t>shared_producer_index;无效producer(){std :: size_t private_producer_index = 0;为了(;;){private_producer_index ++;//省略处理翻转和队列已满/*填充数据*/;shared_producer_index.store(private_producer_index,std :: memory_order_release);//可能是这里的障碍,还是上面的命令更强?}}无效的消费者(){std :: size_t private_consumer_index = 0;为了(;;){std :: size_t观察到的_producer_index = shared_producer_index.load(std :: memory_order_acquire);而(private_consumer_index ==观察到的Producer_index){//可能是这里的障碍,还是下面的更强的秩序?_mm_pause();seen_producer_index = shared_producer_index.load(std :: memory_order_acquire);//省略某些迭代后,从繁忙等待切换为内核等待}/*消耗与索引差异指定的数量一样多的数据*/;private_consumer_index =观察者_生产者_索引;}} 

解决方案

对内核间延迟基本没有明显影响,如果您怀疑在那里的话,绝对不要盲目使用盲目"可能是缓存中缺少后续加载的任何争用.

一个常见的误解是,需要使用asm屏障来使存储缓冲区提交到缓存.实际上,障碍只是使此核心等待已经发生的事情,然后再进行后续加载和/或存储.对于完整的屏障,请阻止以后的加载和存储,直到耗尽存储缓冲区为止. Intel上存储缓冲区的大小硬件?究竟什么是存储缓冲区?

std :: atomic 之前的糟糕年代,编译器障碍是阻止编译器将值保留在寄存器中的一种方法(专用于CPU内核/线程,而不是连贯的),但这是编译问题,而不是asm.理论上可以使用具有非一致性缓存的CPU(std :: atomic需要进行显式刷新以使存储可见),但是是高度相关的,因此我基本上已经至少写过几次这个答案了.(但是,这似乎是一个专门回答此问题的好地方,而无需深入了解哪些障碍在做什么).


可能有一些很小的次要效果,它可以阻止以后的加载,这些加载可能会与RFO竞争(此核心可以获得对缓存行的独占访问权以提交存储).CPU始终尝试尽快(通过提交到L1d缓存)耗尽存储缓冲区.一旦存储提交到L1d缓存,该存储将对所有其他内核全局可见.(因为它们是连贯的;他们仍然必须提出共享请求...)

如果当前核心提交了一些存储数据后将其写回L3高速缓存(尤其是处于共享状态),则可以减少未命中损失,如果另一个核心的负载在该存储提交后发生了.但是,没有做到这一点的好方法.如果生产者的绩效达到了,则可能在L1d和L2中造成冲突未命中除了为下一次读取创建低延迟以外,其他都不重要.

在x86上,英特尔Tremont (低功率Silvermont系列)将介绍 cldemote (是否可以为Intel CPU直接核心到核心通信代码编写代码?

  • 如何强制cpu核心刷新c中的存储缓冲区?
  • x86 MESI使缓存行等待时间问题无效
  • 强制将缓存行迁移到另一个核心(不可能)
  • 有趣的事实:PowerPC上的non-seq_cst存储/加载可以在同一物理核心上的逻辑核心之间进行存储转发,从而使存储对 some 其他核心可见,而后它们对变为全局可见所有其他核心.这是AFAIK唯一真正的硬件硬件机制,用于线程不同意所有对象的全局存储顺序.将两个其他线程是否总是以相同的顺序看到不同线程中不同位置的原子写入?.在包括ARMv8和x86在内的其他ISA上,可以确保存储同时对所有其他内核可见(通过提交到L1d缓存).


    对于负载,CPU已经将需求负载的优先级高于任何其他内存访问(因为当然执行必须等待它们.)负载之前的障碍只会延迟它.

    这可能恰好是时间上的巧合,如果它使它看到正在等待的存储,而不是过早"看到旧的缓存的无聊值.但是,通常没有理由假设或曾经预言 pause 或barrier在加载之前可能是个好主意.

    加载后的障碍也无济于事.以后的加载或存储可能能够启动,但是乱序的CPU通常以最早的优先级进行处理,因此,在此加载有机会获得其加载请求之前,以后的加载可能无法填满所有未完成的加载缓冲区已发送到内核外(假设由于最近存储了另一个内核而导致高速缓存未命中.)

    我想我可以想象如果以后没有准备好这个加载地址(指针追逐情况)并且当该地址变成地址时最大的脱核请求数已经在进行中,则对以后的障碍会有好处.

    任何可能的好处几乎肯定是不值得的;如果有很多有用的工作与该负载无关,它可以填满所有脱核请求缓冲区(Intel的LFB),那么它很可能不在关键路径上,让这些负载处于运行状态可能是一件好事

    TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput?


    C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders.

    In particular, if producer does store(memory_order_release), and consumer observes the stored value with load(memory_order_acquire), there are no fences between load and store. On x86 there are no fences at all, on ARM fences are put operation before store and after load.

    The value stored without a fence will eventually be observed by load without a fence (possibly after few unsuccessful attempts)

    I'm wondering if putting a fence on either of sides of the queue can make the value to be observed faster? What is the latency with and without fence, if so?

    I expect that just having a loop with load(memory_order_acquire) and pause / yield limited to thousands of iterations is the best option, as it is used everywhere, but want to understand why.

    Since this question is about hardware behavior, I expect there's no generic answer. If so, I'm wondering mostly about x86 (x64 flavor), and secondarily about ARM.


    Example:

    T queue[MAX_SIZE]
    
    std::atomic<std::size_t>   shared_producer_index;
    
    void producer()
    {
       std::size_t private_producer_index = 0;
    
       for(;;)
       {
           private_producer_index++;  // Handling rollover and queue full omitted
    
           /* fill data */;
    
          shared_producer_index.store(
              private_producer_index, std::memory_order_release);
          // Maybe barrier here or stronger order above?
       }
    }
    
    
    void consumer()
    {
       std::size_t private_consumer_index = 0;
    
       for(;;)
       {
           std::size_t observed_producer_index = shared_producer_index.load(
              std::memory_order_acquire);
    
           while (private_consumer_index == observed_producer_index)
           {
               // Maybe barrier here or stronger order below?
              _mm_pause();
              observed_producer_index= shared_producer_index.load(
                 std::memory_order_acquire);
              // Switching from busy wait to kernel wait after some iterations omitted
           }
    
           /* consume as much data as index difference specifies */;
    
           private_consumer_index = observed_producer_index;
       }
    }
    

    解决方案

    Basically no significant effect on inter-core latency, and definitely never worth using "blindly" without careful profiling, if you suspect there might be any contention from later loads missing in cache.

    It's a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait for something that was already going to happen on its own, before doing later loads and/or stores. For a full barrier, blocking later loads and stores until the store buffer is drained. Size of store buffers on Intel hardware? What exactly is a store buffer?

    In the bad old days before std::atomic, compiler barriers were one way to stop the compiler from keeping values in registers (private to a CPU core / thread, not coherent), but that's a compilation issue not asm. CPUs with non-coherent caches are possible in theory (where std::atomic would need to do explicit flushing to make a store visible), but in practice no implementation runs std::thread across cores with non-coherent caches.


    If I don't use fences, how long could it take a core to see another core's writes? is highly related, I've written basically this answer at least a few times before. (But this looks like a good place for an answer specifically about this, without getting into the weeds of which barriers do what.)


    There might be some very minor secondary effects of blocking later loads that could maybe compete with RFOs (for this core to get exclusive access to a cache line to commit a store). The CPU always tries to drain the store buffer as fast as possible (by committing to L1d cache). As soon as a store commits to L1d cache, it becomes globally visible to all other cores. (Because they're coherent; they'd still have to make a share request...)

    Getting the current core to write-back some store data to L3 cache (especially in shared state) could reduce the miss penalty if the load on another core happens somewhat after this store commits. But there are no good ways to do that. Creating a conflict miss in L1d and L2 maybe, if producer performance is unimportant other than creating low latency for the next read.

    On x86, Intel Tremont (low power Silvermont series) will introduce cldemote (_mm_cldemote) that writes back a line as far as an outer cache, but not all the way to DRAM. (clwb could possibly help, but does force the store to go all the way to DRAM. Also, the Skylake implementation is just a placeholder and works like clflushopt.)

    Fun fact: non-seq_cst stores/loads on PowerPC can store-forward between logical cores on the same physical core, making stores visible to some other cores before they become globally visible to all other cores. This is AFAIK the only real hardware mechanism for threads to not agree on a global order of stores to all objects. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. On other ISAs, including ARMv8 and x86, it's guaranteed that stores become visible to all other cores at the same time (via commit to L1d cache).


    For loads, CPUs already prioritize demand loads over any other memory accesses (because of course execution has to wait for them.) A barrier before a load could only delay it.

    That might happen to be optimal by coincidence of timing, if that makes it see the store it was waiting for instead of going "too soon" and seeing the old cached boring value. But there's generally no reason to assume or ever predict that a pause or barrier could be a good idea ahead of a load.

    A barrier after a load shouldn't help either. Later loads or stores might be able to start, but out-of-order CPUs generally do stuff in oldest-first priority so later loads probably can't fill up all the outstanding load buffers before this load gets a chance to get its load request sent off-core (assuming a cache miss because another core stored recently.)

    I guess I could imagine a benefit to a later barrier if this load address wasn't ready for a while (pointer-chasing situation) and the max number of off-core requests were already in-flight when the address did become known.

    Any possible benefit is almost certainly not worth it; if there was that much useful work independent of this load that it could fill up all the off-core request buffers (LFBs on Intel) then it might well not be on the critical path and it's probably a good thing to have those loads in flight.

    这篇关于除了提供必要的保证之外,硬件内存屏障还可以使原子操作的可见性更快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    相关文章
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆