如何使用CUDA生成随机排列 [英] How to generate random permutations with CUDA

查看:245
本文介绍了如何使用CUDA生成随机排列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以使用哪些并行算法从给定集合中生成随机排列?
特别是适合CUDA的论文的建议或链接将是有帮助的。

What parallel algorithms could I use to generate random permutations from a given set? Especially proposals or links to papers suitable for CUDA would be helpful.

这是一个顺序版本,将是Fisher-Yates shuffle。

A sequential version of this would be the Fisher-Yates shuffle.

示例:

令S = {1,2,...,7}为源索引集合。
目标是并行生成n个随机排列。
n个排列中的每一个都包含每个源索引一次,
例如{7,6,...,1}。

Let S={1, 2, ..., 7} be the set of source indices. The goal is to generate n random permutations in parallel. Each of the n permutations contains each of the source indices exactly once, e.g. {7, 6, ..., 1}.

推荐答案

Fisher-Yates shuffle可以并行化。例如,4个并发的工作者只需要3次迭代来洗牌8个元素的向量。在第一次迭代中,它们交换0 -1,2 - 3,4 - 5,6 - 7;在第二迭代0→2,1→3,4→5,6→7;和在最后一次迭代0→4,1→5,2→6,3→7。

Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.

这可以很容易地实现为CUDA __ device __ 代码(灵感来自标准 min / max reduction ):

This could be easily implemented as CUDA __device__ code (inspired by standard min/max reduction):

const int id  = threadIdx.x;
__shared__ int perm_shared[2 * BLOCK_SIZE];
perm_shared[2 * id]     = 2 * id;
perm_shared[2 * id + 1] = 2 * id + 1;
__syncthreads();

unsigned int shift = 1;
unsigned int pos = id * 2;  
while(shift <= BLOCK_SIZE)
{
    if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
    shift = shift << 1;
    pos = (pos & ~shift) | ((pos & shift) >> 1);
    __syncthreads();
}

这里的curand初始化代码被省略,方法 swap(int * p,int i,int j)交换值 p [i] p [j] / code>。

Here the curand initialization code is omitted, and method swap(int *p, int i, int j) exchanges values p[i] and p[j].

请注意,上述代码具有以下假设:

Note that the code above has the following assumptions:


  1. 排列长度为2 * BLOCK_SIZE,其中BLOCK_SIZE是2的幂。

  2. 2 * BLOCK_SIZE个整数适合 __ shared __ CUDA设备的内存

  3. BLOCK_SIZE是CUDA块的有效大小(通常介于32和512之间)

  1. The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
  2. 2 * BLOCK_SIZE integers fit into __shared__ memory of CUDA device
  3. BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)

要生成多个排列,我建议使用不同的CUDA块。如果目标是使7个元素的排列(如在原始问题中提到的),那么我相信在单线程中做它会更快。

To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.

这篇关于如何使用CUDA生成随机排列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆