缓存友好的方式来从多个线程收集结果 [英] Cache-friendly way to collect results from multiple threads

查看:67
本文介绍了缓存友好的方式来从多个线程收集结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑N线程执行一些异步任务,这些任务的结果值较小,例如doubleint64_t.因此,大约8结果值可以容纳单个CPU缓存行. N等于CPU内核数.

Consider N threads doing some asynchronous tasks with small result value like double or int64_t. So about 8 result values can fit a single CPU cache line. N is equal to the number of CPU cores.

一方面,如果我只分配一个N个项的数组,每个项是一个doubleint64_t,则8

On one hand, if I just allocate an array of N items, each a double or int64_t, then 8 threads will share a CPU cache line, which seems inefficient.

另一方面,如果我为每个double/int64_t分配一条整个缓存行,则接收器线程将必须获取N缓存行,每条缓存行均由不同的CPU内核(1除外)编写.

On the other hand, if I allocate a whole cache line for each double/int64_t, the receiver thread will have to fetch N cache lines, each written by a different CPU core (except 1).

那么对于这种情况有没有有效的解决方案? CPU是x86-64.最好使用C ++解决方案.

So is there an efficient solution for this scenario? The CPU is x86-64. A solution in C++ is preferred.

说明1 :由于使用了线程池,因此线程启动/退出开销并不大.因此,主要是在关键部分进行同步.

Clarification 1: thread launch/exit overhead is not big because thread pool is used. So it's mostly synchronization on a critical section.

说明2 :并行批处理具有依赖性.主线程只有在收集并处理了前一批的结果之后,才能启动下一批并行计算.因为前一批的结果用作下一批的某些参数.

Clarification 2: The parallel batches carry a dependency. The master thread can only launch the next batch of parallel computations after it has collected and processed the results of the previous batch. Because results of the previous batch serve as some parameters of the next batch.

推荐答案

更新:我可能误会了.您是否正在寻找许多小批量工作的快速周转服务?在这种情况下,最好将每个线程写入自己的缓存行,或者将它们成对分组.如果每个工作线程都必须获得独占访问权(MESI/MESIF/MOESI)才能写入同一缓存行,则这会将所有内核按某种顺序进行序列化.

update: I may have misunderstood. Are you looking for fast turnarounds on many tiny batches of work? In that case, you're probably better off with each thread writing to its own cache line, or maybe group them in pairs. If each worker thread has to get exclusive access (MESI/MESIF/MOESI) to write into the same cache line, that will serialize all the cores into some order.

让读取器线程从N个线程中读取结果,可使所有那些高速缓存未命中并行发生.

Having the reader thread read the results from N threads lets all those cache misses happen in parallel.

根据您的评论 :

我想每秒分散并收集数百万个这样的并行计算.换句话说,头线程分配工作,启动工作线程,然后收集结果,对其执行某些操作,然后再次启动并行计算.

I would like to scatter and gather millions of such parallel calculations per second. In the other words, the head thread distributes the work, launches worker threads, then collects the results, does something on it, and then launches parallel computations again.

因此,您需要收集数百万个结果,但是每个内核只有一个工作线程.因此,每个工作线程必须产生约10万个结果.

So you have millions of results to collect, but only one worker thread per core. So each worker thread has to produce ~100k results.

为每个工作人员提供一个输出 array ,在其中存储已完成的不同任务的连续结果.实际的数组可能只有4k条目长,有些则有些同步,以便在读取器从该线程缓冲区的后半部分开始后,编写者可以回绕并重用前半部分.

Give each worker an output array, where it stores consecutive results from different tasks it has finished. The actual arrays might only be 4k entries long or something, with some synchronization to let the writer wrap around and reuse the first half once the reader has started on the second half of that thread's buffer.

当收集器线程从其中一个数组中读取结果时,它将把该缓存行带入其自己的L2/L1D缓存中,并在同一缓存行中带来其他7个结果(假设在通常情况下,工作线程线程已经填满了所有8个int64_t插槽,不会再为这组微小的任务写该高速缓存行.

When the collector thread reads a result from one of those arrays, it will bring that cache line into its own L2/L1D caches, bringing with it the 7 other results in that same cache line (assuming the usual case where the worker thread has already filled all 8 int64_t slots and won't write that cache line again for this group of tiny tasks).

或者更好的是,按与缓存行对齐的方式批量收集它们,因此,冲突遗漏不会在收集器的L1D返回到它之前将其从缓存器中逐出. (通过对每个线程使用不同的偏移量倾斜结果数组来减少发生这种情况的可能性,因此收集器线程不会读取彼此偏移了4kiB倍数的N条缓存行.)

Or better, collect them in batches aligned to cache lines, so conflict misses don't evict a cache line from the collector's L1D before it gets back to it. (Reduce the chance of this happening by skewing the result arrays with a different offset for each thread, so the collector thread isn't reading N cache lines that are all offset from each other by a multiple of 4kiB or something.)

如果可以在输出数组中使用前哨值,那可能是理想的选择.如果收集器看到了这一点,就知道它领先于工作线程,并应检查其他线程. (或者,如果它通过了所有输出数组而没有找到新结果,则进入睡眠状态.)

If you can use a sentinel value in your output arrays, that's probably ideal. If the collector sees that, it knows it got ahead of the worker and should check other threads. (Or sleep if it got through all output arrays without finding new results).

否则,您需要当前输出位置共享变量,这些共享变量在写入输出数组后工作人员会使用发布存储库对其进行更新. (也许将这些位置计数器更新批处理为每8个数组结果之一.但是请确保使用纯原子存储而不是+= 8进行存储.由于生产者线程是唯一写入该变量的线程,所以这很愚蠢.具有lock add的开销.)

Otherwise you also need current-output-position shared variables which the workers update (with a release-store) after writing the output array. (Maybe batch these position-counter updates to one per 8 array results. But make sure you do it with a pure atomic store, not a += 8. Since the producer thread is the only one that writes that variable, it would be silly to have the overhead of a lock add.)

如果打包到一个数组中,这很容易导致工作线程之间的错误共享,并且肯定也需要缓存(不在UC或WC内存中,因此工作线程可以有效地就地重写它).因此,您绝对希望每个线程都有自己的缓存行.收集器将不得不承受读取N条不同的缓存行的代价(并且可能遭受内存错误推测机器清除的影响:

This would easily cause false sharing between worker threads if packed into one array, and also definitely needs to be cached (not in UC or WC memory, so a worker thread can rewrite it in-place efficiently). So you definitely want each thread to have its own cache line for these. The collector will just have to suffer the penalty of reading N different cache lines (and probably suffering memory mis-speculation machine clears: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)

实际上,在这种情况下,最好的选择可能是将输出数组的每个缓存行中的8个qword之一用作完成"标志或位图,以便收集器线程可以检查以确保高速缓存行中的7条结果是否准备就绪.

Actually, the best option in that case would probably be to use one of the 8 qwords in every cache line of the output arrays as a "complete" flag or bitmap, so the collector thread can check that to see if the 7 results in a cache line are ready.

如果仅在工作线程和收集线程之间获得结果是您的主要瓶颈,则可能是您的线程的粒度太细了.您应该更粗略地分解任务,或者让工作线程来做虽然它在L1D上仍然很热门,但它们在产生的多种结果上进行了一些组合.这比通过L3或DRAM将其传输到另一个内核要好得多.

If just getting the results between worker and collector threads is your main bottleneck, then probably your threading is too fine-grained. You should break your tasks up more coarsely, or have your worker threads do some of the combining on multiple results it produced, while they're still hot in its L1D. That's much better bandwidth than getting it to another core through L3 or DRAM.

这篇关于缓存友好的方式来从多个线程收集结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆