在没有预分配的cuRAND函数内生成随机数 [英] Generate random number within a function with cuRAND without preallocation

查看:748
本文介绍了在没有预分配的cuRAND函数内生成随机数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以在设备功能中生成随机数,而无需预先分配所有状态?我想在实时生成和使用它们。我需要使用他们的蒙特卡罗模拟什么是最适合这个目的?下面生成的数字是单精度,是否可以使它们具有双精度?

  #include< iostream> 
#includecuda_runtime.h
#includedevice_launch_parameters.h
#include< curand_kernel.h>

__global__ void cudaRand(float * d_out,unsigned long seed)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init(seed,i,0,& state);
d_out [i] = curand_uniform(& state);
}

int main(int argc,char ** argv)
{
size_t N = 1< 4;
float * v = new float [N];

float * d_out;
cudaMalloc((void **)& d_out,N * sizeof(float));

//生成随机数字
cudaRand<< < 1,N> > (d_out,time(NULL));

cudaMemcpy(v,d_out,N * sizeof(float),cudaMemcpyDeviceToHost);

for(size_t i = 0; i {
printf(out:%f \\\
,v [i]);
}

cudaFree(d_out);
delete [] v;

return 0;
}






UPDATE

  #include< iostream> 
#includecuda_runtime.h
#includedevice_launch_parameters.h
#include< curand_kernel.h>
#include< ctime>

__global__ void cudaRand(double * d_out)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init((unsigned long long)clock()+ i,0,0,& state);

d_out [i] = curand_uniform_double(& state);
}

int main(int argc,char ** argv)
{
size_t N = 1< 4;
double * h_v = new double [N];

double * d_out;
cudaMalloc((void **)& d_out,N * sizeof(double));

//生成随机数字
cudaRand<< < 1,N> > (d_out);

cudaMemcpy(h_v,d_out,N * sizeof(double),cudaMemcpyDeviceToHost);

for(size_t i = 0; i printf(out:%f \\\
,h_v [i]);

cudaFree(d_out);
delete [] h_v;

return 0;
}


解决方案

__ device __ / __ global __ 函数中的类似情况:

  int tId = threadIdx.x +(blockIdx.x * blockDim.x); 
curandState state;
curand_init((unsigned long long)clock()+ tId,0,0,& state);

double rand1 = curand_uniform_double(& state);
double rand2 = curand_uniform_double(& state);

所以只需使用 curand_uniform_double 生成随机双精度并且我相信你不想要所有的线程相同的种子,这是我想通过使用 clock()+ tId 实现。这样,在任何两个线程中具有相同 rand1 / rand2 的可能性接近于nil。 p>

编辑:



但是,根据以下意见, 可能会导致有偏差的结果:




  • Jackolantern指出了这部分curand文档:


    用不同种子生成的序列通常没有统计相关的值,但是种子的一些选择



  • 此外,还有 devtalk线程专门用于如何提高 curand_init 解决方案加快curand初始化是:


    你可以做的一件事是为每个线程使用不同的种子,固定的子序列为0,


    但是同一张海报后来表示:


    缺点是你会失去线程之间的一些好的数学属性。在从种子初始化发电机状态的散列函数和发电机的周期性之间存在不良交互是可能的。如果发生这种情况,你可能会得到两个线程与高度相关的输出一些种子。 我不知道这样的任何问题,即使他们确实存在,他们很可能很少





因此,基本上取决于您是否需要更好的性能(如我所做的)或1000%的不带偏见的结果。如果这是你想要的,那么Jackolantern提出的解决方案是正确的,即初始化curand为:

  curand_init long long)clock(),tId,0,& state)

但是, offset 子序列参数的c> 0 值会降低性能。有关这些参数的详情,您可以查看此SO主题以及 curand documentation



Jackolantern在注释中说:


我会说,从同一个内核中调用curand_init和curand_uniform_double是不可能的......第二,curand_init初始化伪随机数生成器并设置其所有参数,所以我恐怕你的方法会有点慢。


我在几篇论文中讨论了这个问题,尝试了各种方法在每个线程中获取不同的随机数,并在每个线程中创建 curandState 原来是我最可行的解决方案。我需要在每个线程中生成〜10个随机数,其中我尝试:




  • 开发自己的简单随机数生成器),因此在每个线程中的 curandState 结果是优越的,

  • 预分配 curandState s并重复使用它们 - 这是内存很重,当我减少预分配状态的数量时,我不得不使用非零值 offset / / 参数 curand_uniform_double


  • 因此,在进行彻底分析后,我决定确实调用 curand_init curand_uniform_double 。唯一的问题是这些国家占用的注册表的数量,所以我不得不小心块大小不超过每个块可用的注册表的最大数量。



    这是我不得不说的提供的解决方案,我终于能够测试,它是工作正常在我的机器/ GPU。我从上述问题的 UPDATE 部分运行代码,并且在控制台中正确显示了16个不同的随机数。因此,我建议您在执行内核后正确执行错误检查,看看里面出了什么问题。此主题在此SO主题中有详细说明。


    Is it possible to generate random numbers within a device function without preallocate all the states? I would like to generate and use them in "realtime". I need to use them for Monte Carlo simulations what are the most suitable for this purpose? The number generated below are single precision is it possible to have them in double precision?

    #include <iostream>
    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"
    #include <curand_kernel.h>
    
    __global__ void cudaRand(float *d_out, unsigned long seed)
    {
        int i = blockDim.x * blockIdx.x + threadIdx.x;
        curandState state;
        curand_init(seed, i, 0, &state);
        d_out[i] = curand_uniform(&state);
    }
    
    int main(int argc, char** argv)
    {
        size_t N = 1 << 4;
        float *v = new float[N];
    
        float *d_out;
        cudaMalloc((void**)&d_out, N * sizeof(float));
    
        // generate random numbers
        cudaRand << < 1, N >> > (d_out, time(NULL));
    
        cudaMemcpy(v, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
    
        for (size_t i = 0; i < N; i++)
        {
            printf("out: %f\n", v[i]);
        }
    
        cudaFree(d_out);
        delete[] v;
    
        return 0;
    }
    


    UPDATE

    #include <iostream>
    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"
    #include <curand_kernel.h>
    #include <ctime>
    
    __global__ void cudaRand(double *d_out)
    {
        int i = blockDim.x * blockIdx.x + threadIdx.x;
        curandState state;
        curand_init((unsigned long long)clock() + i, 0, 0, &state);
    
        d_out[i] = curand_uniform_double(&state);
    }
    
    int main(int argc, char** argv)
    {
        size_t N = 1 << 4;
        double *h_v = new double[N];
    
        double *d_out;
        cudaMalloc((void**)&d_out, N * sizeof(double));
    
        // generate random numbers
        cudaRand << < 1, N >> > (d_out);
    
        cudaMemcpy(h_v, d_out, N * sizeof(double), cudaMemcpyDeviceToHost);
    
        for (size_t i = 0; i < N; i++)
            printf("out: %f\n", h_v[i]);
    
        cudaFree(d_out);
        delete[] h_v;
    
        return 0;
    }
    

    解决方案

    How I was dealing with the similar situation in the past, within __device__/__global__ function:

    int tId = threadIdx.x + (blockIdx.x * blockDim.x);
    curandState state;
    curand_init((unsigned long long)clock() + tId, 0, 0, &state);
    
    double rand1 = curand_uniform_double(&state);
    double rand2 = curand_uniform_double(&state);
    

    So just use curand_uniform_double for generating random doubles and also I believe you don't want the same seed for all of the threads, thats what I am trying to achieve by using clock() + tId instead. This way the odds of having the same rand1/rand2 in any of the two threads are close to nil.

    EDIT:

    However, based on below comments, proposed approach may perhaps lead to biased result:

    • JackOLantern pointed me to this part of curand documentation:

      Sequences generated with different seeds usually do not have statistically correlated values, but some choices of seeds may give statistically correlated sequences.

    • Also there is a devtalk thread devoted to how to improve performance of curand_init in which the proposed solution to speed up the curand initialization is:

      One thing you can do is use different seeds for each thread and a fixed subsequence of 0 and offset of 0.

      But the same poster is later stating:

      The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don't know of any problems like this, and even if they do exist they will most likely be rare.

    So it is basically up to you whether you want better performance (as I did) or 1000% unbiased results. If that is what you desire, then solution proposed by JackOLantern is the correct one, i.e. initialize curand as:

    curand_init((unsigned long long)clock(), tId, 0, &state)
    

    Using not 0 value for offset and subsequence parameters is, however, decreasing performance. For more info on these parameters you may review this SO thread and also curand documentation.

    I see that JackOLantern stated in comment that:

    I would say it is not recommandable to call curand_init and curand_uniform_double from within the same kernel from two reasons ........ Second, curand_init initializes the pseudorandom number generator and sets all of its parameters, so I'm afraid your approach will be somewhat slow.

    I was dealing with this in my thesis on several pages, tried various approaches to get different random numbers in each thread and creating curandState in each of the threads turned out to be the most viable solution for me. I needed to generate ~10 random numbers in each thread and among others I tried:

    • developing my own simple random number generator (Linear Congruential Generator) whose intialization was basically for free, however, the performance suffered greatly when generating numbers, so in the end having curandState in each thread turned out to be superior,
    • pre-allocating curandStates and reusing them - this was memory heavy and when I decreased number of preallocated states then I had to use non zero values for offset/subsequence parameters of curand_uniform_double in order to get rid of bias which led to decreased performance when generating numbers.

    So after making thorough analysis I decided to indeed call curand_init and curand_uniform_double in each thread. The only problem was with the amount of registry that these states were occupying so I had to be careful with the block sizes not to exceed the max number of registry available to each block.

    Thats what I have to say about provided solution which I was finally able to test and it is working just fine on my machine/GPU. I run the code from UPDATE section in the above question and 16 different random numbers were displayed in the console correctly. Therefore I advise you to properly perform error checking after executing kernel to see what went wrong inside. This topic is very well covered in this SO thread.

    这篇关于在没有预分配的cuRAND函数内生成随机数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆