如何使用CUDA生成随机排列 [英] How to generate random permutations with CUDA

查看：245 发布时间：2017/3/4 12:54:32 c++ algorithm cuda thrust

本文介绍了如何使用CUDA生成随机排列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我可以使用哪些并行算法从给定集合中生成随机排列？
特别是适合CUDA的论文的建议或链接将是有帮助的。

What parallel algorithms could I use to generate random permutations from a given set? Especially proposals or links to papers suitable for CUDA would be helpful.

这是一个顺序版本，将是Fisher-Yates shuffle。

A sequential version of this would be the Fisher-Yates shuffle.

示例：

令S = {1,2，...，7}为源索引集合。
目标是并行生成n个随机排列。
n个排列中的每一个都包含每个源索引一次，
例如{7，6，...，1}。

Let S={1, 2, ..., 7} be the set of source indices. The goal is to generate n random permutations in parallel. Each of the n permutations contains each of the source indices exactly once, e.g. {7, 6, ..., 1}.

推荐答案

Fisher-Yates shuffle可以并行化。例如，4个并发的工作者只需要3次迭代来洗牌8个元素的向量。在第一次迭代中，它们交换0 -1，2 - 3，4 - 5,6 - 7;在第二迭代0→2，1→3，4→5，6→7;和在最后一次迭代0→4,1→5,2→6,3→7。

Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.

这可以很容易地实现为CUDA __ device __ 代码（灵感来自标准 min / max reduction ）：

This could be easily implemented as CUDA __device__ code (inspired by standard min/max reduction):

const int id  = threadIdx.x;
__shared__ int perm_shared[2 * BLOCK_SIZE];
perm_shared[2 * id]     = 2 * id;
perm_shared[2 * id + 1] = 2 * id + 1;
__syncthreads();

unsigned int shift = 1;
unsigned int pos = id * 2;  
while(shift <= BLOCK_SIZE)
{
    if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
    shift = shift << 1;
    pos = (pos & ~shift) | ((pos & shift) >> 1);
    __syncthreads();
}

这里的curand初始化代码被省略，方法 swap（int * p，int i，int j）交换值 p [i] 和 p [j] / code>。


Here the curand initialization code is omitted, and method swap(int *p, int i, int j) exchanges values p[i] and p[j].
请注意，上述代码具有以下假设：
Note that the code above has the following assumptions:
 
 排列长度为2 * BLOCK_SIZE，其中BLOCK_SIZE是2的幂。
 
  2 * BLOCK_SIZE个整数适合 __ shared __  CUDA设备的内存
 
  BLOCK_SIZE是CUDA块的有效大小（通常介于32和512之间）
 
 

The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
2 * BLOCK_SIZE integers fit into __shared__ memory of CUDA device
BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)

要生成多个排列，我建议使用不同的CUDA块。如果目标是使7个元素的排列（如在原始问题中提到的），那么我相信在单线程中做它会更快。 
To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread. 

                        这篇关于如何使用CUDA生成随机排列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何使用CUDA生成随机排列 [英] How to generate random permutations with CUDA

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何使用CUDA生成随机排列 [英] How to generate random permutations with CUDA

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭