如何有效地并行设置位向量的位? [英] How to set bits of a bit vector efficiently in parallel?

查看:77
本文介绍了如何有效地并行设置位向量的位?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑其中的一个N位的位向量(N大)和一个M个数字的数组(M是中等的,通常比N小得多),每个范围在指示矢量的哪一位必须设置为1.后面的数组未排序.位向量只是一个整数数组,特别是__m256i,其中每个__m256i结构中都包含256位.

如何在多个线程之间有效地拆分这项工作?

首选语言是C ++(MSVC ++ 2017工具集v141),汇编语言也很棒.首选的CPU是x86_64(本机还可以).需要AVX2,如果可以从中受益的话.

解决方案

让我们假设您想将此工作划分为T个线程.这是一个非常有趣的问题,因为它无法通过分区轻松并行化,并且各种解决方案可能适用于NM的不同大小.

完全并行的基线

您可以简单地将数组M划分为T分区,并使每个线程在具有共享N的自己的M分区上工作.主要问题在于,由于未对M进行排序,因此所有线程都可以访问N的任何元素,因此彼此之间的工作量非常大.为了避免这种情况,您必须对共享的N数组进行每次修改时都使用原子操作,例如std::atomic::fetch_or,否则就要提出一些锁定方案.两种方法都可能会降低性能(即,使用原子操作设置位可能比等效的单线程代码慢一个数量级).

让我们看看可能更快的想法.

私人N

一个相对明显的想法是避免需要对N的所有突变进行原子运算的共享N"问题,只是给每个T一个N的私有副本,并通过or最终将它们合并.

不幸的是,此解决方案是O(N) + O(M/T),而原始的单线程解决方案是O(M),而上面的原子"解决方案类似于O(M/T) 4 .由于我们知道N >> M在这种情况下可能是一个较差的权衡.仍然需要注意的是,每个术语中的隐藏常数都非常不同:O(N)术语(来自合并步骤 0 )可以使用256位宽的vpor指令,这意味着吞吐量大约200-500位/周期(如果已缓存),而位设置步骤O(M/T)我估计接近1位/周期.因此,即使N的大小是M的大小的10或100倍,这种方法也肯定是中等T的最佳方法.

M的分区

这里的基本思想是对M中的索引进行分区,以使每个工作线程随后都可以在N数组的不相交部分上工作.如果M被排序,那将是微不足道的,但事实并非如此,所以...

如果平滑地分布 的一种简单算法很好地起作用,该算法是将M的值首先划分为T个存储桶,其中存储桶的值在.也就是说,将N划分为T个不相交的区域,然后找到属于每个区域的M值.您可以通过为每个线程分配相等大小的M块,并让它们各自创建T分区,然后进行逻辑合并 1 放在最后,因此您拥有MT分区.

第二步实际上是设置所有位:您为每个线程分配一个分区T可以以单线程"方式设置位,即,不必担心并发更新,因为每个线程都在工作在N 2 的不相交分区上.

步骤O(M)和第二步都与单线程情况相同,因此并行化此步骤的开销是第一步.我怀疑第一台计算机的速度大约与第二台计算机的速度相同,可能是实施和硬件速度的2-4倍,因此您可以期望在具有多个内核的计算机上加速,但是只有2或4个内核不会更好.

如果M的分配不是 smooth ,则第一步中创建的分区的大小差别很大,那么它将无法正常工作,因为某些线程将获得更多的工作.一个简单的策略是创建而不是仅创建10 * T分区,而不是仅创建T分区,并使第二遍中的线程全部从同一分区队列中消耗直到完成.这样,除非数组M非常聚集,否则您可以更均匀地分配工作.在这种情况下,您可能会考虑对第一步进行优化,首先要创建元素的桶状直方图,然后是reduce阶段,查看合并后的直方图以创建良好的分区.

从本质上讲,我们只是在逐步将第一阶段改进为一种并行排序/划分算法,对此已有大量文献报道.您甚至可能会发现完全(并行)排序是最快的,因为它将对位设置阶段有很大帮助,因为访问将是有序的并且具有最佳的空间位置(分别帮助预取和缓存).


0 ...,也可以从分配长度为N的私有数组"步骤开始,尽管这可能很快.

1 从概念上讲,最简单的合并形式是简单地复制每个线程的M分区,以便您拥有所有M的连续分区,但是实际上,如果分区很大,可以将分区留在原处并将它们链接在一起,从而为使用代码增加了一些复杂性,但避免了压缩步骤.

2 要使其真正与线程分离,您需要确保N的分区位于字节边界"上,甚至可能位于高速缓存行边界上,以避免虚假共享(尽管后者可能不会成为一个大问题,因为它仅发生在每个分区的边缘,并且处理的顺序意味着您不太可能引起争用.)

4 实际上,很难定义使用共享N的基准并发解决方案的确切顺序",因为会存在争用,因此O(M/T)缩放比例将分解得足够大T.如果我们假定N很大,并且T限于典型的最多12个内核的硬件并发性,那么大概是可以的.

Consider a bit vector of N bits in it (N is large) and an array of M numbers (M is moderate, usually much smaller than N), each in range 0..N-1 indicating which bit of the vector must be set to 1. The latter array is not sorted. The bit vector is just an array of integers, specifically __m256i, where 256 bits are packed into each __m256i structure.

How can this work be split efficiently accross multiple threads?

Preferred language is C++ (MSVC++2017 toolset v141), assembly is also great. Preferred CPU is x86_64 (intrinsics are ok). AVX2 is desired, if any benefit from it.

解决方案

Let's assume you want to divide this work up among T threads. It's a pretty interesting problem since it isn't trivially parallelizable via partitioning and various solutions may apply for different sizes of N and M.

Fully Concurrent Baseline

You could simply divide up the array M into T partitions and have each thread work on its own partition of M with a shared N. The main problem is that since M is not sorted, all threads may access any element of N and hence stomp on each others work. To avoid this, you'd have to use atomic operations such as std::atomic::fetch_or for each modification of the shared N array, or else come up with some locking scheme. Both approaches are likely to kill performance (i.e., using an atomic operation to set a bit is likely to be an order of magnitude slower than the equivalent single-threaded code).

Let's look at ideas that are likely faster.

Private N

One relatively obvious idea to avoid the "shared N" problem which requires atomic operations for all mutations of N is simply to give each T a private copy of N and merge them at the end via or.

Unfortunately, this solution is O(N) + O(M/T) whereas the original single-threaded solution is O(M) and the "atomic" solution above is something like O(M/T)4. Since we know that N >> M this is likely to be a poor tradeoff in this case. Still, it's worth noting that the hidden constants in each term are very different: the O(N) term, which comes from the merging step0 can use 256-bit wide vpor instructions, meaning a throughput of something close to 200-500 bits/cycle (if cached), while the bit-setting step which is O(M/T) I estimate at closer to 1 bit/cycle. So this approach can certainly be the best one for moderate T even if the size of N is 10 or 100 times the size of M.

Partitions of M

The basic idea here is to partition the indexes in M such that each worker thread can then work on a disjoint part of the N array. If M was sorted, that would be trivial, but it's not, so...

A simple algorithm that will work well if M is smoothly distributed is to first partition that values of M into T buckets, with the buckets having values in the ranges [0, N/T), [N/T, 2N/T], ..., [(T-1)N/T, N). That is, divide N into T disjoint regions and then find the values of M that fall into each of them. You can spread that work across the T threads by assigning each thread an equal size chunk of M, and having them each create the T partitions and then logically merging1 them at the end so you have the T partitions of M.

The second step is to actually set all the bits: you assign one partition to each thread T which can set the bits in a "single threaded" way, i.e., not worrying about concurrent updates, since each thread is working on a disjoint partition of N2.

Both steps O(M) and the second step is identical to the single-threaded case, so the overhead for parallelizing this is the first step. I suspect the first will range from about the same speed as the second to perhaps 2-4 times as slow, depending on implementation and hardware, so you can expect a speedup on a machine with many cores, but with only 2 or 4 it might not be any better.

If the distribution of M is not smooth, such that the partitions created in the first step have very different sizes, it will work poorly because some threads will get a lot more work. A simple strategy is to create say 10 * T partitions, rather than only T and have the threads in the second pass all consume from the same queue of partitions until complete. In this way you spread the work more evenly, unless the array M is very bunched up. In that case you might consider a refinement of the first step which first essentially creates a bucketed histogram of the elements, and then a reduce stage which looks at the combined histogram to create a good partitioning.

Essentially, we are just progressively refining the first stage into a type of parallel sort/partitioning algorithm, for which there is already lots of literature. You might even find that a full (parallel) sort is fastest, since it will greatly help in bit-setting phase, since accesses will be in-order and have the best spatial locality (helping with prefetching and caching, respectively).


0 ... and also from the "allocate a private array of length N" step, although this is likely to be quite fast.

1 The conceptually simplest form of merging would be to simply copy each thread's partitions of M such that you have a contiguous partition of all of M, but in practice if the partitions are large you can just leave the partitions where they are and link them together, adding some complexity to the consuming code, but avoiding the compacting step.

2 To make it truly disjoint from a threading point of view you want to ensure the partition of N falls on "byte boundaries", and perhaps even cache-line boundaries to avoid false sharing (although the latter is likely not to be a big problem since it only occurs at the edge of each partition, and the order of processing means that you are not likely to get contention).

4 In practice, the exact "order" of the baseline concurrent solution using shared N is hard to define because there will be contention so the O(M/T) scaling will break down for large enough T. If we assume N is quite large and T is limited to typical hardware concurrency of at most a dozen cores or so it's probably an OK approximation.

这篇关于如何有效地并行设置位向量的位?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆