如何生成随机排列与CUDA [英] How to generate random permutations with CUDA
问题描述
我能使用什么样的并行算法从一组给定生成随机排列? 特别是提案或链接到适合CUDA论文将是有益的。
What parallel algorithms could I use to generate random permutations from a given set? Especially proposals or links to papers suitable for CUDA would be helpful.
这方面的一个顺序版本将是费雪耶茨洗牌。
A sequential version of this would be the Fisher-Yates shuffle.
例如:
令S = {1,2,...,7}是集合源索引。 的目标是产生n个随机排列平行。 上述n个排列的包含各源索引正好一次, 例如{7,6,...,1}
Let S={1, 2, ..., 7} be the set of source indices. The goal is to generate n random permutations in parallel. Each of the n permutations contains each of the source indices exactly once, e.g. {7, 6, ..., 1}.
推荐答案
费雪耶茨洗牌,可以并行。例如,4个并发的工人只需要3次迭代洗牌的8个元素的矢量。关于第一迭代它们互换0℃ - > 1,第2版; - > 3,4℃ - > 5,第6版; - > 7;关于第二迭代0℃ - > 2,1&其中; - > 3,4℃ - > 5,第6版; - > 7;和最后一次迭代0℃ - > 4,1&其中; - > 5,第2版; - > 6,第3版; - > 7
Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.
这可以很容易地实现为CUDA __ __设备
code(通过标准的启发<一href="http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf"相对=nofollow>最小值/最大值减少):
This could be easily implemented as CUDA __device__
code (inspired by standard min/max reduction):
const int id = threadIdx.x;
__shared__ int perm_shared[2 * BLOCK_SIZE];
perm_shared[2 * id] = 2 * id;
perm_shared[2 * id + 1] = 2 * id + 1;
__syncthreads();
unsigned int shift = 1;
unsigned int pos = id * 2;
while(shift <= BLOCK_SIZE)
{
if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
shift = shift << 1;
pos = (pos & ~shift) | ((pos & shift) >> 1);
__syncthreads();
}
下面的curand初始化code被省略,和方法掉期(INT * P,诠释我,诠释J)
交流值点[我]
和 P [J]
。
Here the curand initialization code is omitted, and method swap(int *p, int i, int j)
exchanges values p[i]
and p[j]
.
请注意,code以上具有以下假设:
Note that the code above has the following assumptions:
- 置换的长度为2 * BLOCK_SIZE,其中BLOCK_SIZE是2的幂。
- 2 * BLOCK_SIZE整数放入
__ CUDA设备共享__
内存 - 在BLOCK_SIZE是CUDA块(通常在32到512的东西) 的有效大小
- The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
- 2 * BLOCK_SIZE integers fit into
__shared__
memory of CUDA device - BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)
要生成多个置换我会建议使用不同的CUDA块。如果目标是使7元置换(因为它是提到了原来的问题),那么我相信这将是更快地做它在单个线程。
To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.
这篇关于如何生成随机排列与CUDA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!