CUDA流压缩算法 [英] CUDA stream compaction algorithm

查看:261
本文介绍了CUDA流压缩算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用CUDA构建一个并行算法,该算法采用整数数组并删除所有0,而不考虑是否保留顺序.

I'm trying to construct a parallel algorithm with CUDA that takes an array of integers and removes all of the 0's with or without keeping the order.

示例:

全局内存:{0,0,0,0,14,0,0,17,17,0,0,0,0,13}

Global Memory: {0, 0, 0, 0, 14, 0, 0, 17, 0, 0, 0, 0, 13}

主机内存结果:{17,13,14,0,0,...}

Host Memory Result: {17, 13, 14, 0, 0, ...}

最简单的方法是使用主机在O(n)时间内删除0.但是考虑到我周围有1000元素,将所有内容保留在GPU上并在发送之前先进行压缩可能会更快.

The simplest way is to use the host to remove the 0's in O(n) time. But considering I have around 1000 elements, it probably will be faster to leave everything on the GPU and condense it first, before sending it.

首选方法是创建一个设备上的堆栈,这样每个线程都可以弹出(按任何顺序)并推入或推出堆栈.但是,我认为CUDA没有实现此功能.

The preferred method would be to create an on-device stack, such that each thread can pop and push (in any order) onto or off of the stack. However, I don't think CUDA has an implementation of this.

一种等效的方法(但要慢得多)是继续尝试进行写操作,直到所有线程都完成写操作为止:

An equivalent (but much slower) method would be to keep attempting to write, until all threads have finished writing:

kernalRemoveSpacing(int * array, int * outArray, int arraySize) {
    if (array[threadId.x] == 0)
        return;

    for (int i = 0; i < arraySize; i++) {

         array = arr[threadId.x];

         __threadfence();

         // If we were the lucky thread we won! 
         // kill the thread and continue re-reincarnated in a different thread
         if (array[i] == arr[threadId.x])
             return;
    }
}

此方法的唯一好处是我们将在O(f(x))时间内执行,其中f(x)是数组中存在的非零值的平均数量(对于我的实现,f(x) ~= ln(n)为时间,因此为O(ln(n)) ,但具有较高的O常量)

This method has only benefit in that we would perform in O(f(x)) time, where f(x) is the average number of non-zero values there are in an array (f(x) ~= ln(n) for my implementation, thus O(ln(n)) time, but has a high O constant)

最后,诸如quicksort或mergesort之类的排序算法也将解决该问题,并且实际上在相对时间O(ln(n))中运行.我认为甚至有一种算法可以比这更快,因为我们不需要浪费时间排序(交换)零零元素对和非零非零元素对(不需要保留顺序).

Finally, a sort algorithm such as quicksort or mergesort would also solve the problem, and does in fact run in O(ln(n)) relative time. I think there might be an algorithm faster than this even, as we do not need to waste time ordering (swapping) zero-zero element pairs, and non-zero non-zero element pairs (the order does not need to be kept).

所以我不太确定哪种方法是最快的,但我仍然 认为有更好的方法来处理此问题.有什么建议吗?

So I'm not quite sure which method would be the fastest, and I still think there's a better way of handling this. Any suggestions?

推荐答案

您要的是一种经典的并行算法,称为流压缩 1 .

What you are asking for is a classic parallel algorithm called stream compaction1.

如果选择推力",则可以简单地使用 thrust::copy_if .这是一种稳定的算法,它保留了所有元素的相对顺序.

If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements.

草图:

#include <thrust/copy.h>

template<typename T>
struct is_non_zero {
    __host__ __device__
    auto operator()(T x) const -> bool {
        return T != 0;
    }
};

// ... your input and output vectors here

thrust::copy_if(input.begin(), input.end(), output.begin(), is_non_zero<int>());

如果不是 选项,则您可以自己实现流压缩(有关该主题的文献很多).这是一个有趣且相当简单的练习,同时也是更复杂的并行基元的基本构建块.

If Thrust is not an option, you may implement stream compaction yourself (there is plenty of literature on the topic). It's a fun and reasonably simple exercise, while also being a basic building block for more complex parallel primitives.

(1) 严格来说,这不是传统意义上的精确流压缩,因为流压缩在传统上是一种稳定的算法,但您要求不包括稳定性.放宽要求可能会导致更有效的实施?

(1) Strictly speaking, it's not exactly stream compaction in the traditional sense, as stream compaction is traditionally a stable algorithm but your requirements do not include stability. This relaxed requirement could perhaps lead to a more efficient implementation?

这篇关于CUDA流压缩算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆