如何生成随机排列与CUDA [英] How to generate random permutations with CUDA

查看:216
本文介绍了如何生成随机排列与CUDA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我能使用什么样的并行算法从一组给定生成随机排列? 特别是提案或链接到适合CUDA论文将是有益的。

What parallel algorithms could I use to generate random permutations from a given set? Especially proposals or links to papers suitable for CUDA would be helpful.

这方面的一个顺序版本将是费雪耶茨洗牌。

A sequential version of this would be the Fisher-Yates shuffle.

例如:

令S = {1,2,...,7}是集合源索引。 的目标是产生n个随机排列平行。 上述n个排列的包含各源索引正好一次, 例如{7,6,...,1}

Let S={1, 2, ..., 7} be the set of source indices. The goal is to generate n random permutations in parallel. Each of the n permutations contains each of the source indices exactly once, e.g. {7, 6, ..., 1}.

推荐答案

费雪耶茨洗牌,可以并行。例如,4个并发的工人只需要3次迭代洗牌的8个元素的矢量。关于第一迭代它们互换0℃ - > 1,第2版; - > 3,4℃ - > 5,第6版; - > 7;关于第二迭代0℃ - > 2,1&其中; - > 3,4℃ - > 5,第6版; - > 7;和最后一次迭代0℃ - > 4,1&其中; - > 5,第2版; - > 6,第3版; - > 7

Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.

这可以很容易地实现为CUDA __ __设备 code(通过标准的启发<一href="http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf"相对=nofollow>最小值/最大值减少):

This could be easily implemented as CUDA __device__ code (inspired by standard min/max reduction):

const int id  = threadIdx.x;
__shared__ int perm_shared[2 * BLOCK_SIZE];
perm_shared[2 * id]     = 2 * id;
perm_shared[2 * id + 1] = 2 * id + 1;
__syncthreads();

unsigned int shift = 1;
unsigned int pos = id * 2;  
while(shift <= BLOCK_SIZE)
{
    if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
    shift = shift << 1;
    pos = (pos & ~shift) | ((pos & shift) >> 1);
    __syncthreads();
}

下面的curand初始化code被省略,和方法掉期(INT * P,诠释我,诠释J)交流值点[我] P [J]

Here the curand initialization code is omitted, and method swap(int *p, int i, int j) exchanges values p[i] and p[j].

请注意,code以上具有以下假设:

Note that the code above has the following assumptions:

  1. 置换的长度为2 * BLOCK_SIZE,其中BLOCK_SIZE是2的幂。
  2. 2 * BLOCK_SIZE整数放入 __ CUDA设备共享__ 内存
  3. 在BLOCK_SIZE是CUDA块(通常在32到512的东西)
  4. 的有效大小
  1. The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
  2. 2 * BLOCK_SIZE integers fit into __shared__ memory of CUDA device
  3. BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)

要生成多个置换我会建议使用不同的CUDA块。如果目标是使7元置换(因为它是提到了原来的问题),那么我相信这将是更快地做它在单个线程。

To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.

这篇关于如何生成随机排列与CUDA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆