高效的全对在GPU上设置交集 [英] Efficient all-pairs set intersection on GPU

查看:102
本文介绍了高效的全对在GPU上设置交集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 n 组,是有限宇宙的子集。我想计算 n * n 矩阵,其中(I,J)项包含交集的基数集合 I 和集合 J 的集合。 n 的顺序为 50000

I have n sets, subsets of a finite universe. I want to calculate the n*n matrix in which the (I, J) entry contains the cardinality of the intersection of set I and set J. n is in the order of 50000.

我的想法是将矩阵分成足够小的块,以便每个条目只有一个线程。每个线程都应使用 bitwise和来计算交点。

My idea is to split the matrix into blocks sufficiently small so to have one thread per entry. Every thread should calculate the intersection using bitwise and.

有没有更有效的方法来解决此问题?

Are there more efficient approaches to solve this problem?

推荐答案

解决该问题的另一种方法是将 universe 拆分为1024个元素的较小分区

A different approach to breaking up the problem is to to split the universe into smaller partitions of 1024 elements each, and compute just the size of the intersections in this part of the universe.

我不确定我是否已经很好地描述了这一点。基本上你是在计算

I'm not sure if I've described that well; basically you're computing

A[i,j] = sum((k in v[i]) * (k in w[j]) for k in the_universe)

其中 v w 是两个集合列表,而S 中的 k是如果为true,则为1 ,否则为 0 。关键是要置换索引,使 k outer 循环中,而不是在 inner 循环中,尽管对于效率,您必须一次处理多个连续的 k ,而不是一次处理。

where v and w are the two lists of sets, and k in S is 1 if true and 0 otherwise. The point is to permute the indices so that k is in the outer loop rather than the inner loop, although for efficiency you will have to work with many consecutive k at once, rather than one at a time.

,则将输出矩阵初始化为全零,并为1024个Universe元素的每个块计算交点的大小并将结果累加到输出矩阵中。

That is, you initialize the output matrix to all zeroes, and for each block of 1024 universe elements, you compute the sizes of the intersections and accumulate the results into the output matrix.

我选择1024,因为我想您将拥有一个数据布局,该数据布局可能是最小的数据布局,当您从设备内存读取数据时,您仍然可以获得完整的内存带宽,并且warp中的所有线程都可以协同工作。 (如果您比我更了解,或者您没有使用nVidia,并且正在使用的其他GPU都可以更好地工作,请进行适当的调整)

I choose 1024, because I imagine you'll have a data layout where that's probably the smallest size where you can still get the full memory bandwidth when reading from device memory, and all of the threads in warp work together. (adjust this appropriately if you know better than me, or you aren't using nVidia and whatever other GPUs you're using would work with something better)

现在,您的元素的大小合理,您现在可以使用传统的线性代数优化来计算此乘积。我可能会执行以下操作:

Now that your elements are a reasonable size, you can now appeal to traditional linear algebra optimizations to compute this product. I would probably do the following:

每个扭曲都被分配了大量的输出矩阵行。它从第二个向量中读取相应的元素,然后遍历第一个向量,计算出乘积。

Each warp is assigned a large number of rows of the output matrix. It reads the corresponding elements out of the second vector, and then iterates through the first vector, computing products.

您可以让所有的扭曲都独立运行,但是可能最好执行以下操作:

You could have all of the warps operate independently, but it may be better to do the following:


  • 一个块中的所有扭曲共同协作以从第一个向量中加载一定数量的元素

  • 每个warp计算它可以的交点并将结果写入输出矩阵

您可以存储在共享内存中加载元素,但是将它们保存在寄存器中可能会得到更好的结果。每个扭曲仅可以计算与其持有的设置元素的交点,但是您可以这样做,但是扭曲之后可以全部旋转哪些扭曲持有哪些元素。

You could store the loaded elements in shared memory, but you might get better results holding them in registers. Each warp can only compute the intersections with the set elements its holding onto, and you but after doing so the warps can all rotate which warps are holding which elements.

如果沿着这些路线进行足够的优化,您可能会达到不再受内存限制的地步,这意味着您可能不必走到最复杂的优化的地步(例如,上述共享内存方法可能已经足够了)。

If you do enough optimizations along these lines, you will probably reach the point where you are no longer memory bound, which means you might not have to go so far as to do the most complicated optimizations (e.g. the shared memory approach described above might already be enough).

这篇关于高效的全对在GPU上设置交集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆