如何在pyCUDA内核中生成随机数? [英] How to generate random number inside pyCUDA kernel?

查看:214
本文介绍了如何在pyCUDA内核中生成随机数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用pyCUDA进行CUDA编程。我需要在内核函数中使用随机数。 CURAND库无法在其中使用(pyCUDA)。由于在GPU中有很多工作要做,因此在CPU内部生成随机数然后将其传输到GPU将无法正常工作,而是消除了使用GPU的动机。

I am using pyCUDA for CUDA programming. I need to use random number inside kernel function. CURAND library doesn't work inside it (pyCUDA). Since, there is lot of work to be done in GPU, generating random number inside CPU and then transferring them to GPU won't work, rather dissolve the motive of using GPU.

补充问题:


  1. 是否可以使用以下方法在GPU上分配内存1个块和1个线程。

  2. 我正在使用多个内核。我需要使用多个SourceModule块吗?


推荐答案

尽管您在问题中有断言, PyCUDA对CUrand提供了相当全面的支持。 GPUArray模块具有直接接口,可使用主机端API填充设备内存(请注意,在这种情况下,随机数生成器在GPU上运行)。

Despite what you assert in your question, PyCUDA has pretty comprehensive support for CUrand. The GPUArray module has a direct interface to fill device memory using the host side API (noting that the random generators run on the GPU in this case).

也可以在PyCUDA内核代码中使用CUrand的设备端API。在这种用例中,最棘手的部分是为线程生成器状态分配内存。有三种选择-静态地在代码中,动态地使用主机内存侧分配和动态地使用设备侧内存分配。下面的示例(经过非常轻松的测试)说明了后者,就像您在问题中所提出的那样:

It is also perfectly possible to use the device side API from CUrand in PyCUDA kernel code. In this use case the trickiest part is allocating memory for the thread generator states. There are three choices -- statically in code, dynamically using host memory side allocation, and dynamically using device side memory allocation. The following (very lightly tested) example illustrates the latter, seeing as you asked about it in your question:

import numpy as np
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray

code = """
    #include <curand_kernel.h>

    const int nstates = %(NGENERATORS)s;
    __device__ curandState_t* states[nstates];

    __global__ void initkernel(int seed)
    {
        int tidx = threadIdx.x + blockIdx.x * blockDim.x;

        if (tidx < nstates) {
            curandState_t* s = new curandState_t;
            if (s != 0) {
                curand_init(seed, tidx, 0, s);
            }

            states[tidx] = s;
        }
    }

    __global__ void randfillkernel(float *values, int N)
    {
        int tidx = threadIdx.x + blockIdx.x * blockDim.x;

        if (tidx < nstates) {
            curandState_t s = *states[tidx];
            for(int i=tidx; i < N; i += blockDim.x * gridDim.x) {
                values[i] = curand_uniform(&s);
            }
            *states[tidx] = s;
        }
    }
"""

N = 1024
mod = SourceModule(code % { "NGENERATORS" : N }, no_extern_c=True, arch="sm_52")
init_func = mod.get_function("_Z10initkerneli")
fill_func = mod.get_function("_Z14randfillkernelPfi")

seed = np.int32(123456789)
nvalues = 10 * N
init_func(seed, block=(N,1,1), grid=(1,1,1))
gdata = gpuarray.zeros(nvalues, dtype=np.float32)
fill_func(gdata, np.int32(nvalues), block=(N,1,1), grid=(1,1,1))

这里有一个初始化内核,需要运行一次以为生成器状态分配内存并用种子对其进行初始化,然后是一个内核使用这些状态。如果要运行很多线程,则需要注意malloc堆大小的限制,但是可以通过PyCUDA驱动程序API接口对其进行操作。

Here there is an initialization kernel which needs to be run once to allocate memory for the generator states and initialize them with the seed, and then a kernel which uses those states. You will need to be mindful of malloc heap size limits if you want to run a lot of threads, but those can be manipulated via the PyCUDA driver API interface.

这篇关于如何在pyCUDA内核中生成随机数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆