升压lockfree spsc_queue高速缓冲存储器访问 [英] boost lockfree spsc_queue cache memory access

查看:499
本文介绍了升压lockfree spsc_queue高速缓冲存储器访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须非常关注速度/延迟在我目前多线程的项目。

I need to be extremely concerned with speed/latency in my current multi-threaded project.

高速缓存访​​问的东西我想更好地理解。而且我不是如何清晰无锁队列(如升压:: lockfree :: spsc_queue)上的高速缓存级别的访问/使用的内存。

Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.

我见过使用队列,其中一个大对象,需要由消费者核心操作上的指针推入队列。

I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.

如果消费者芯弹出从队列的元件,我presume则意味着元素(在此情况下的指针)已经装入消费者芯的L2和L1高速缓存。但访问元件,它不需要通过发现,要么无论从L3高速缓存或跨越互连装载元件(如果其他线程是在不同的CPU插口)来访问指针本身?如果是这样,这将可能会更好简单的发送,可以由消费者进行处理的对象的副本?

If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?

感谢您。

推荐答案

C ++一个主要的付费的什么,你需要的生态系统。

C++ principally a pay-for-what-you-need eco-system.

任何常规的队列将让的的选择存储语义(通过值或引用)。

Any regular queue will let you choose the storage semantics (by value or by reference).

不过,这个时候你订购一些特别的东西:您订购的锁自由队列。
为了是锁自由的,它必须能够执行所有观察到的修改操作作为原子操作。这自然限制了可在这些操作直接使用的类型。

However, this time you ordered something special: you ordered a lock free queue. In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.

您可能会怀疑是否它甚至有可能有一个数值类型超过了系统的本机寄存器的大小(例如,的int64_t )。

You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).

好问题。

事实上,任何基于节点的容器也只是要求对所有修改操作,这是平凡中的所有现代建筑取得原子指针交换。
但做任何事情,涉及到的复制的多个不同的内存区域,在非原子序列,真正构成一个无法解决的问题?

Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures. But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?

没有。想象POD数据项扁平阵列。现在,如果你把数组作为循环缓冲器,一会就必须保持原子缓冲区前和结束位置的索引。容器可能,在内部的'脏前面指数休闲升级,而它的副本领先的外部的前面。 (副本可以用轻松的内存排序)。仅只要整个复制已知已经完成时,外部前面索引被更新。此更新需要在acq_rel / CST内存为了 [1]

No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].

只要容器能保守不变的从来没有完全环绕,并达到 ,这是一个甜蜜的交易。我觉得这个想法(LMAX的成名)的干扰器图书馆推广。你得到的机械共振的距离

As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from


  • 非线性内存访问模式,而读/写

  • 甚至更好,如果你可以用一致的记录大小(倍数)物理缓存行

  • 所有数据都是本地,除非POD包含记录
  • 之外的原料引用
  • linear memory access patterns while reading/writing
  • even better if you can make the record size aligned with (a multiple) physical cache lines
  • all the data is local unless the POD contains raw references outside that record

  1. 是的,spqc_queue专卖店在一个连续的内存对齐块的原始元素值:(例如:从 compile_time_sized_ringbuffer 这underlies spsc_queue 与静态提供最大容量:)

  1. Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)

typedef typename boost::aligned_storage<max_size * sizeof(T),
                                        boost::alignment_of<T>::value
                                       >::type storage_type;

storage_type storage_;

T * data()
{
    return static_cast<T*>(storage_.address());
}

(元素类型 T 甚至不需要是POD,但它需要既默认构造的,能够复制)。

(The element type T need not even be POD, but it needs to be both default-constructible and copyable).

是的,读取和写入指针都是原子整数值。请注意,提升开发员们小心地施加足够的填充避免 伪共享 上的高速缓存行的读/写指数:(从 ringbuffer_base

Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):

static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;


  • 在事实上,正如你所看到的,也有在读或写边只有内部的指标。这是可能的,因为只有一个写线程,也只有一个读线程,这意味着只能有更多的在写入操作结束比预期的。空间

  • In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.

    其他几个优化是present:

    Several other optimizations are present:


    • 分支prediction提示为支持它的(不可能的()
    • 平台
    • 有可能推/立刻弹出一个元素的范围。这应该提高产量的情况下,你需要从一个缓冲器/ ringbuffer到另一个虹吸,特别是如果原始元素大小不等于(的整数倍),超高速缓存行

    • 使用的std ::的unitialized_copy尽可能

    • 无关紧要的构造函数的调用/析构函数将在实例化时被优化掉了

    • 的unitialized_copy将被优化成的memcpy上的所有主流标准库的实现(即如SSE指令,如果你的架构支持它会经营)

    所有的一切,我们看到了一个最佳的一流的可能想法为ringbuffer

    All in all, we see a best-in-class possible idea for a ringbuffer

    升压给你所有选项​​。您的可以的选择,使您的元素类型的指针,你的消息类型。但是,因为你已经在你的问题提出,这个级别的间接降低了局部性,可能不是最佳的。

    Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.

    在另一方面,存储在所述元素类型完整的消息类型可以如果复制是昂贵变得昂贵。至少,尽量使元素类型很好地融入一个高速缓存行(在英特尔通常为64字节)。

    On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).

    因此​​,在实践中,你可能会考虑在价值就在那里存储频繁使用的数据,并参考使用指针少-的使用的数据(指针的成本会很低,除非它遍历)。

    So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).

    如果你需要的附件的模式,考虑使用自定义的分配器所引用到的数据,因此您可以实现存储访问模式也有。

    If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.

    [1] 我想地说SPSC acq_rel应该工作,但是我在细节上有点生疏。作为一个规则,我在此点不写无锁code自己。我建议其他人跟着我的例子:)

    [1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

    这篇关于升压lockfree spsc_queue高速缓冲存储器访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆