如何在 pyCUDA 内核中生成随机数? [英] How to generate random number inside pyCUDA kernel?

查看:26
本文介绍了如何在 pyCUDA 内核中生成随机数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 pyCUDA 进行 CUDA 编程.我需要在内核函数中使用随机数.CURAND 库在其中不起作用(pyCUDA).由于 GPU 有很多工作要做,在 CPU 内部生成随机数然后将它们传输到 GPU 是行不通的,反而消解了使用 GPU 的动机.

I am using pyCUDA for CUDA programming. I need to use random number inside kernel function. CURAND library doesn't work inside it (pyCUDA). Since, there is lot of work to be done in GPU, generating random number inside CPU and then transferring them to GPU won't work, rather dissolve the motive of using GPU.

补充问题:

  1. 有没有办法使用 1 个块和 1 个线程在 GPU 上分配内存.
  2. 我正在使用多个内核.我需要使用多个 SourceModule 块吗?

推荐答案

尽管您在问题中断言,PyCUDA 对 CUrand 的支持非常全面.GPUArray 模块具有使用主机端 API 填充设备内存的直接接口(请注意,在这种情况下,随机生成器在 GPU 上运行).

Despite what you assert in your question, PyCUDA has pretty comprehensive support for CUrand. The GPUArray module has a direct interface to fill device memory using the host side API (noting that the random generators run on the GPU in this case).

也完全可以在 PyCUDA 内核代码中使用来自 CUrand 的设备端 API.在这个用例中,最棘手的部分是为线程生成器状态分配内存.有三种选择——在代码中静态、动态使用主机内存端分配和动态使用设备端内存分配.以下(非常轻微测试)示例说明了后者,正如您在问题中询问的那样:

It is also perfectly possible to use the device side API from CUrand in PyCUDA kernel code. In this use case the trickiest part is allocating memory for the thread generator states. There are three choices -- statically in code, dynamically using host memory side allocation, and dynamically using device side memory allocation. The following (very lightly tested) example illustrates the latter, seeing as you asked about it in your question:

import numpy as np
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray

code = """
    #include <curand_kernel.h>

    const int nstates = %(NGENERATORS)s;
    __device__ curandState_t* states[nstates];

    __global__ void initkernel(int seed)
    {
        int tidx = threadIdx.x + blockIdx.x * blockDim.x;

        if (tidx < nstates) {
            curandState_t* s = new curandState_t;
            if (s != 0) {
                curand_init(seed, tidx, 0, s);
            }

            states[tidx] = s;
        }
    }

    __global__ void randfillkernel(float *values, int N)
    {
        int tidx = threadIdx.x + blockIdx.x * blockDim.x;

        if (tidx < nstates) {
            curandState_t s = *states[tidx];
            for(int i=tidx; i < N; i += blockDim.x * gridDim.x) {
                values[i] = curand_uniform(&s);
            }
            *states[tidx] = s;
        }
    }
"""

N = 1024
mod = SourceModule(code % { "NGENERATORS" : N }, no_extern_c=True, arch="sm_52")
init_func = mod.get_function("_Z10initkerneli")
fill_func = mod.get_function("_Z14randfillkernelPfi")

seed = np.int32(123456789)
nvalues = 10 * N
init_func(seed, block=(N,1,1), grid=(1,1,1))
gdata = gpuarray.zeros(nvalues, dtype=np.float32)
fill_func(gdata, np.int32(nvalues), block=(N,1,1), grid=(1,1,1))

这里有一个初始化内核,它需要运行一次来​​为生成器状态分配内存并使用种子初始化它们,然后是使用这些状态的内核.如果要运行大量线程,则需要注意 malloc 堆大小限制,但可以通过 PyCUDA 驱动程序 API 接口进行操作.

Here there is an initialization kernel which needs to be run once to allocate memory for the generator states and initialize them with the seed, and then a kernel which uses those states. You will need to be mindful of malloc heap size limits if you want to run a lot of threads, but those can be manipulated via the PyCUDA driver API interface.

这篇关于如何在 pyCUDA 内核中生成随机数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆