如何生成随机排列与CUDA [英] How to generate random permutations with CUDA

查看：216 发布时间：2015/11/30 15:07:30 c++ algorithm cuda thrust

本文介绍了如何生成随机排列与CUDA的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我能使用什么样的并行算法从一组给定生成随机排列？特别是提案或链接到适合CUDA论文将是有益的。

What parallel algorithms could I use to generate random permutations from a given set? Especially proposals or links to papers suitable for CUDA would be helpful.

这方面的一个顺序版本将是费雪耶茨洗牌。

A sequential version of this would be the Fisher-Yates shuffle.

例如：

令S = {1，2，...，7}是集合源索引。的目标是产生n个随机排列平行。上述n个排列的包含各源索引正好一次，例如{7,6，...，1}

Let S={1, 2, ..., 7} be the set of source indices. The goal is to generate n random permutations in parallel. Each of the n permutations contains each of the source indices exactly once, e.g. {7, 6, ..., 1}.

推荐答案

费雪耶茨洗牌，可以并行。例如，4个并发的工人只需要3次迭代洗牌的8个元素的矢量。关于第一迭代它们互换0℃ - > 1，第2版; - > 3，4℃ - > 5，第6版; - > 7;关于第二迭代0℃ - > 2，1＆其中; - > 3，4℃ - > 5，第6版; - > 7;和最后一次迭代0℃ - > 4,1＆其中; - > 5，第2版; - > 6，第3版; - > 7

Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.

这可以很容易地实现为CUDA __ __设备 code（通过标准的启发<一href="http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf"相对=nofollow>最小值/最大值减少）：

This could be easily implemented as CUDA __device__ code (inspired by standard min/max reduction):

const int id  = threadIdx.x;
__shared__ int perm_shared[2 * BLOCK_SIZE];
perm_shared[2 * id]     = 2 * id;
perm_shared[2 * id + 1] = 2 * id + 1;
__syncthreads();

unsigned int shift = 1;
unsigned int pos = id * 2;  
while(shift <= BLOCK_SIZE)
{
    if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
    shift = shift << 1;
    pos = (pos & ~shift) | ((pos & shift) >> 1);
    __syncthreads();
}

下面的curand初始化code被省略，和方法掉期（INT * P，诠释我，诠释J）交流值点[我] 和 P [J] 。

Here the curand initialization code is omitted, and method swap(int *p, int i, int j) exchanges values p[i] and p[j].

请注意，code以上具有以下假设：

Note that the code above has the following assumptions:

置换的长度为2 * BLOCK_SIZE，其中BLOCK_SIZE是2的幂。
2 * BLOCK_SIZE整数放入 __ CUDA设备共享__ 内存
在BLOCK_SIZE是CUDA块（通常在32到512的东西）

The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
2 * BLOCK_SIZE integers fit into __shared__ memory of CUDA device
BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)

要生成多个置换我会建议使用不同的CUDA块。如果目标是使7元置换（因为它是提到了原来的问题），那么我相信这将是更快地做它在单个线程。

To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.

这篇关于如何生成随机排列与CUDA的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何生成随机排列与CUDA [英] How to generate random permutations with CUDA

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何生成随机排列与CUDA [英] How to generate random permutations with CUDA

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭