最小化MC模拟过程中存储的cuRAND状态数 [英] Minimize the number of cuRAND states stored during a MC simulation

查看:109
本文介绍了最小化MC模拟过程中存储的cuRAND状态数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在CUDA中编写蒙特卡罗模拟.因此,我需要使用cuRAND库动态生成很多个随机数. 每个线程处理巨大的float数组中的一个元素(在示例中省略),并在每次内核调用时生成1或2个随机数.

通常的方法(请参见下面的示例)似乎是为每个线程分配一个状态.状态将在后续的内核调用中重用. 但是,当线程数增加时(在我的应用程序中最多为10µ),这不能很好地扩展,因为它已成为程序中的主要内存(和带宽)使用情况.

我知道一种可能的方法是在 2048 ie 在以下示例中为2个块.
每个线程是否可以仅使用2048个状态,而与线程总数无关? 生成的所有随机数仍应在统计上独立.

我认为,如果每个驻留线程都与一个唯一的以2048为模的索引相关联,则可以这样做,然后可以使用该索引从数组中获取状态. 是否存在这样的索引?

更一般地说,还有其他方法可以减少RNG状态的内存占用吗?

示例代码(每个线程一个状态)

#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <assert.h>

#define gridSize 100
#define blockSize 1024
#define iter 100

__global__ void rng_init(unsigned long long seed, curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  curand_init(seed, Idx, 0, &states[Idx]);
}

__global__ void kernel(curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  const float x = curand_uniform(&states[Idx]);
  // Do stuff with x ...
}

int main() {
  curandState * states;
  assert(cudaMalloc(&states, gridSize*blockSize*sizeof(curandState)) == cudaSuccess);
  rng_init<<<gridSize,blockSize>>>(clock(), states);
  assert(cudaDeviceSynchronize() == cudaSuccess);

  for (size_t i = 0 ; i < iter ; ++i) {
    kernel<<<gridSize,blockSize>>>(states);
    assert(cudaDeviceSynchronize() == cudaSuccess);
  }
  return 0;
}

解决方案

总之,在您的问题中,您提到使用网格跨越循环作为折叠所需状态的一种方式.我认为这种方法或类似方法(由@talonmies建议)是最明智的方法.选择一种将线程数减少到最少的数量,以保持机器繁忙/充分利用的方法.然后让每个线程计算多个操作,重新使用提供的随机生成器状态.

我从您的代码外壳开始,将其变成了经典的"hello world" MC问题:根据正方形的面积与内切圆的面积之比计算pi,随机生成点以估算面积. /p>

然后我们将考虑3种方法:

  1. 创建一个大的一维网格以及每个线程的状态,以便每个线程计算一个随机点并对其进行测试.

  2. 创建一个更小的一维网格以及每个线程的状态,并允许每个线程计算多个随机点/测试,以便生成与情况1相同数量的点/测试. /p>

  3. 创建与方法2相同大小的网格,还创建唯一驻留线程索引",然后提供足够的状态以仅覆盖唯一驻留线程.每个线程将有效地使用通过驻留线程索引"提供的状态,而不是普通的全局唯一线程索引.计算与案例1相同的分数/测试数.

唯一驻留线程索引"的计算不是一件小事.在主机代码中:

  1. 我们必须确定理论上可能驻留的最大块数.我为此使用了一种简单的启发式方法,但是可以说有更好的方法.我只需将每个多处理器的最大驻留线程数除以每个块所选择的线程数即可.为此,我们必须使用整数除法.

  2. 然后初始化足够的状态,以覆盖最大块数乘以GPU上SM的数量.

在设备代码中:

  1. 确定我在哪个SM上.
  2. 使用该SM,执行原子测试以获取每个SM的唯一块号.
  3. 将1和2的结果与threadIdx.x变量组合在一起,以创建唯一驻留线程索引".然后,这将成为我们典型的线程索引,而不是通常的计算.

从结果的角度来看,可以得出以下结论:

  1. 与其他答案中所述相反,初始化时间不是无关紧要.它不应该被忽略.对于这个简单的问题,初始化内核的运行时间超过了计算内核的运行时间.因此,我们最重要的要点是,我们应该寻求方法以最小化随机生成器状态的创建.我们当然不想不必要地重新运行初始化.因此,对于这个特定的代码/测试,我们可能应该根据这些结果放弃方法1.

  2. 从计算内核运行时的角度来看,几乎没有什么值得称赞的.

  3. 从代码复杂性的角度来看,方法2明显比方法3复杂,同时具有相同的性能.

那么,对于这个特定的测试用例,方法2似乎是胜利者,正如@talonmies所预测的那样.

以下是这3种方法的有效示例.我并不是在说每种情况,代码或方案都没有缺陷或具有指导意义.这里有很多动态的部分,但是我相信上面的3个结论对于这种情况是有效的.

$ cat t1201.cu
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <assert.h>
#include <iostream>
#include <stdlib.h>

#define blockSize 1024

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

#define MAX_SM 64

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


__device__ unsigned long long count = 0;
__device__ unsigned int blk_ids[MAX_SM] = {0};

__global__ void rng_init(unsigned long long seed, curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  curand_init(seed, Idx, 0, &states[Idx]);
}

__global__ void kernel(curandState * states, int length) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = 0; i < length; i++){
    const float x = curand_uniform(&states[Idx]);
    const float y = curand_uniform(&states[Idx]);
    if (sqrtf(x*x+y*y)<1.0)
      atomicAdd(&count, 1ULL);}
}

static __device__ __inline__ int __mysmid(){
  int smid;
  asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
  return smid;}

__device__ int get_my_resident_thread_id(int sm_blk_id){
  return __mysmid()*sm_blk_id + threadIdx.x;
}

__device__ int get_block_id(){
  int my_sm = __mysmid();
  int my_block_id = -1;
  bool done = false;
  int i = 0;
  while ((!done)&&(i<32)){
    unsigned int block_flag = 1<<i;
    if ((atomicOr(blk_ids+my_sm, block_flag)&block_flag) == 0){my_block_id = i; done = true;}
    i++;}
  return my_block_id;
}

__device__ void release_block_id(int block_id){
  unsigned int block_mask = ~(1<<block_id);
  int my_sm = __mysmid();
  atomicAnd(blk_ids+my_sm, block_mask);
}

__global__ void kernel2(curandState * states, int length) {

  __shared__ volatile int my_block_id;
  if (!threadIdx.x) my_block_id = get_block_id();
  __syncthreads();
  const size_t Idx = get_my_resident_thread_id(my_block_id);
  for (int i = 0; i < length; i++){
    const float x = curand_uniform(&states[Idx]);
    const float y = curand_uniform(&states[Idx]);
    if (sqrtf(x*x+y*y)<1.0)
      atomicAdd(&count, 1ULL);}
  __syncthreads();
  if (!threadIdx.x) release_block_id(my_block_id);
  __syncthreads();
}



int main(int argc, char *argv[]) {
  int gridSize = 10;
  if (argc > 1) gridSize = atoi(argv[1]);
  curandState * states;
  assert(cudaMalloc(&states, gridSize*gridSize*blockSize*sizeof(curandState)) == cudaSuccess);
  unsigned long long hcount;
  //warm - up
  rng_init<<<gridSize*gridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  //method 1: 1 curand state per point
  std::cout << "Method 1 init blocks: " << gridSize*gridSize << std::endl;
  unsigned long long dtime = dtime_usec(0);
  rng_init<<<gridSize*gridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  unsigned long long initt = dtime_usec(dtime);
  kernel<<<gridSize*gridSize,blockSize>>>(states, 1);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  dtime = dtime_usec(dtime) - initt;
  cudaMemcpyFromSymbol(&hcount, count, sizeof(unsigned long long));
  std::cout << "method 1 elapsed time: " << dtime/(float)USECPSEC << " init time: " << initt/(float)USECPSEC << " pi: " << 4.0f*hcount/(float)(gridSize*gridSize*blockSize) << std::endl;
  hcount = 0;
  cudaMemcpyToSymbol(count, &hcount, sizeof(unsigned long long));
  //method 2: 1 curand state per gridSize points
  std::cout << "Method 2 init blocks: " << gridSize << std::endl;
  dtime = dtime_usec(0);
  rng_init<<<gridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  initt = dtime_usec(dtime);
  kernel<<<gridSize,blockSize>>>(states, gridSize);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  dtime = dtime_usec(dtime) - initt;
  cudaMemcpyFromSymbol(&hcount, count, sizeof(unsigned long long));
  std::cout << "method 2 elapsed time: " << dtime/(float)USECPSEC << " init time: " << initt/(float)USECPSEC << " pi: " << 4.0f*hcount/(float)(gridSize*gridSize*blockSize) << std::endl;
  hcount = 0;
  cudaMemcpyToSymbol(count, &hcount, sizeof(unsigned long long));
  //method 3: 1 curand state per resident thread
  // compute the maximum number of state entries needed
  int num_sms;
  cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);
  int max_sm_threads;
  cudaDeviceGetAttribute(&max_sm_threads, cudaDevAttrMaxThreadsPerMultiProcessor, 0);
  int max_blocks = max_sm_threads/blockSize;
  int total_state = max_blocks*num_sms*blockSize;
  int rgridSize = (total_state + blockSize-1)/blockSize;
  std::cout << "Method 3 sms: " << num_sms << " init blocks: " << rgridSize << std::endl;
  // run test
  dtime = dtime_usec(0);
  rng_init<<<rgridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  initt = dtime_usec(dtime);
  kernel2<<<gridSize,blockSize>>>(states, gridSize);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  dtime = dtime_usec(dtime) - initt;
  cudaMemcpyFromSymbol(&hcount, count, sizeof(unsigned long long));
  std::cout << "method 3 elapsed time: " << dtime/(float)USECPSEC << " init time: " << initt/(float)USECPSEC << " pi: " << 4.0f*hcount/(float)(gridSize*gridSize*blockSize) << std::endl;
  hcount = 0;
  cudaMemcpyToSymbol(count, &hcount, sizeof(unsigned long long));
  return 0;
}

$ nvcc -arch=sm_35 -O3 -o t1201 t1201.cu
$ ./t1201 28
Method 1 init blocks: 784
method 1 elapsed time: 0.001218 init time: 3.91075 pi: 3.14019
Method 2 init blocks: 28
method 2 elapsed time: 0.00117 init time: 0.003629 pi: 3.14013
Method 3 sms: 14 init blocks: 28
method 3 elapsed time: 0.001193 init time: 0.003622 pi: 3.1407
$

对于方法3,此处

I am currently writing a Monte-Carlo simulation in CUDA. As such, I need to generate lots of random numbers on the fly using the cuRAND library. Each thread processes one element in a huge float array (omitted in the example) and generates 1 or 2 random numbers per kernel invocation.

The usual approach (see example below) seems to be to allocate one state per thread. The states are reused in subsequent kernel invocations. However, this does not scale well when the number of threads increases (up to 10⁸ in my application), since it becomes the dominant memory (and bandwidth) usage in the program.

I know that one possible approach would be to process multiple array elements per thread in a grid stride loop fashion. Here I want to investigate a different method.

I am aware that for the compute capability I am targeting (3.5), the maximum number of resident threads per SM is 2048, i.e. 2 blocks in the example below.
Is it possible to use only 2048 states per multiprocessor, regardless of the total number of threads ? All random numbers generated should still be statistically independent.

I think this might be done if each resident thread were associated a unique index modulo 2048, which could then be used to fetch the state from an array. Does such an index exist ?

More generally, are there other ways to reduce the memory footprint of the RNG states ?

Example code (one state per thread)

#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <assert.h>

#define gridSize 100
#define blockSize 1024
#define iter 100

__global__ void rng_init(unsigned long long seed, curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  curand_init(seed, Idx, 0, &states[Idx]);
}

__global__ void kernel(curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  const float x = curand_uniform(&states[Idx]);
  // Do stuff with x ...
}

int main() {
  curandState * states;
  assert(cudaMalloc(&states, gridSize*blockSize*sizeof(curandState)) == cudaSuccess);
  rng_init<<<gridSize,blockSize>>>(clock(), states);
  assert(cudaDeviceSynchronize() == cudaSuccess);

  for (size_t i = 0 ; i < iter ; ++i) {
    kernel<<<gridSize,blockSize>>>(states);
    assert(cudaDeviceSynchronize() == cudaSuccess);
  }
  return 0;
}

解决方案

In short, in your question, you mention using a grid-striding loop as a way to collapse the required state. I think this method, or one like it (as suggested by @talonmies) is the most sensible approach. Choose a method that reduces the thread count to the minimum necessary to keep the machine busy/fully utilized. Then have each thread compute multiple operations, re-using the provided random generator state.

I started with your code shell and turned it into a classical "hello world" MC problem: compute pi based on the ratio of the area of the square to the area of the inscribed circle, randomly generating points to estimate area.

Then we will consider 3 methods:

  1. Create a large 1D grid, as well as state for each thread, so that each thread computes one random point and tests it.

  2. Create a much smaller 1D grid, as well as state for each thread, and allow each thread to compute multiple random points/tests, so that the same number of points/tests are generated as in case 1.

  3. Create a grid the same size as method 2, but also create a "unique resident thread index", and then provide enough state to cover just the unique resident threads. Each thread will effectively use the state provided using the "resident thread index", rather than an ordinary globally unique thread index. Compute the same number of points/tests as in case 1.

The computation of the "unique resident thread index" is not a trivial matter. In host code:

  1. We must determine the maximum number of blocks that are theoretically possible to be resident. I have used a simple heuristic for this, but there are arguably better methods. I simply divide the maximum resident threads per multiprocessor by the number of chosen threads per block. We must use integer division for this.

  2. Then initialize enough state to cover the maximum number of blocks times the number of SMs on the GPU.

In device code:

  1. Determine which SM I am on.
  2. Using that SM, perform an atomic test to retrieve a unique block number per SM.
  3. Combine the results of 1, and 2, plus the threadIdx.x variable to create a "unique resident thread index". This then becomes our typical thread index instead of the usual calculation.

From a results perspective, the following observations can be made:

  1. Contrary to what is stated in the other answer, the init time is not insignificant. It should not be ignored. For this simple problem, the init kernel run-time outweighs the calculation kernel run time. Therefore our most important takeaway is that we should seek methods to minimize the creation of random generator state. We certainly do not want to needlessly re-run the initialization. Therefore we should probably discard method 1 based on these results, for this particular code/test.

  2. From a calculation kernel runtime perspective, there is little to commend one kernel over another.

  3. From a code complexity perspective, method 2 is clearly less complex than method 3, while giving about the same performance.

For this particular test case, then, method 2 seems to be the winner, exactly as predicted by @talonmies.

What follows is a worked example of the 3 methods. I'm not claiming that this is defect free or instructive for every case, code, or scenario. There are a lot of moving parts here, but I believe the 3 conclusions above are valid for this case.

$ cat t1201.cu
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <assert.h>
#include <iostream>
#include <stdlib.h>

#define blockSize 1024

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

#define MAX_SM 64

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


__device__ unsigned long long count = 0;
__device__ unsigned int blk_ids[MAX_SM] = {0};

__global__ void rng_init(unsigned long long seed, curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  curand_init(seed, Idx, 0, &states[Idx]);
}

__global__ void kernel(curandState * states, int length) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = 0; i < length; i++){
    const float x = curand_uniform(&states[Idx]);
    const float y = curand_uniform(&states[Idx]);
    if (sqrtf(x*x+y*y)<1.0)
      atomicAdd(&count, 1ULL);}
}

static __device__ __inline__ int __mysmid(){
  int smid;
  asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
  return smid;}

__device__ int get_my_resident_thread_id(int sm_blk_id){
  return __mysmid()*sm_blk_id + threadIdx.x;
}

__device__ int get_block_id(){
  int my_sm = __mysmid();
  int my_block_id = -1;
  bool done = false;
  int i = 0;
  while ((!done)&&(i<32)){
    unsigned int block_flag = 1<<i;
    if ((atomicOr(blk_ids+my_sm, block_flag)&block_flag) == 0){my_block_id = i; done = true;}
    i++;}
  return my_block_id;
}

__device__ void release_block_id(int block_id){
  unsigned int block_mask = ~(1<<block_id);
  int my_sm = __mysmid();
  atomicAnd(blk_ids+my_sm, block_mask);
}

__global__ void kernel2(curandState * states, int length) {

  __shared__ volatile int my_block_id;
  if (!threadIdx.x) my_block_id = get_block_id();
  __syncthreads();
  const size_t Idx = get_my_resident_thread_id(my_block_id);
  for (int i = 0; i < length; i++){
    const float x = curand_uniform(&states[Idx]);
    const float y = curand_uniform(&states[Idx]);
    if (sqrtf(x*x+y*y)<1.0)
      atomicAdd(&count, 1ULL);}
  __syncthreads();
  if (!threadIdx.x) release_block_id(my_block_id);
  __syncthreads();
}



int main(int argc, char *argv[]) {
  int gridSize = 10;
  if (argc > 1) gridSize = atoi(argv[1]);
  curandState * states;
  assert(cudaMalloc(&states, gridSize*gridSize*blockSize*sizeof(curandState)) == cudaSuccess);
  unsigned long long hcount;
  //warm - up
  rng_init<<<gridSize*gridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  //method 1: 1 curand state per point
  std::cout << "Method 1 init blocks: " << gridSize*gridSize << std::endl;
  unsigned long long dtime = dtime_usec(0);
  rng_init<<<gridSize*gridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  unsigned long long initt = dtime_usec(dtime);
  kernel<<<gridSize*gridSize,blockSize>>>(states, 1);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  dtime = dtime_usec(dtime) - initt;
  cudaMemcpyFromSymbol(&hcount, count, sizeof(unsigned long long));
  std::cout << "method 1 elapsed time: " << dtime/(float)USECPSEC << " init time: " << initt/(float)USECPSEC << " pi: " << 4.0f*hcount/(float)(gridSize*gridSize*blockSize) << std::endl;
  hcount = 0;
  cudaMemcpyToSymbol(count, &hcount, sizeof(unsigned long long));
  //method 2: 1 curand state per gridSize points
  std::cout << "Method 2 init blocks: " << gridSize << std::endl;
  dtime = dtime_usec(0);
  rng_init<<<gridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  initt = dtime_usec(dtime);
  kernel<<<gridSize,blockSize>>>(states, gridSize);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  dtime = dtime_usec(dtime) - initt;
  cudaMemcpyFromSymbol(&hcount, count, sizeof(unsigned long long));
  std::cout << "method 2 elapsed time: " << dtime/(float)USECPSEC << " init time: " << initt/(float)USECPSEC << " pi: " << 4.0f*hcount/(float)(gridSize*gridSize*blockSize) << std::endl;
  hcount = 0;
  cudaMemcpyToSymbol(count, &hcount, sizeof(unsigned long long));
  //method 3: 1 curand state per resident thread
  // compute the maximum number of state entries needed
  int num_sms;
  cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);
  int max_sm_threads;
  cudaDeviceGetAttribute(&max_sm_threads, cudaDevAttrMaxThreadsPerMultiProcessor, 0);
  int max_blocks = max_sm_threads/blockSize;
  int total_state = max_blocks*num_sms*blockSize;
  int rgridSize = (total_state + blockSize-1)/blockSize;
  std::cout << "Method 3 sms: " << num_sms << " init blocks: " << rgridSize << std::endl;
  // run test
  dtime = dtime_usec(0);
  rng_init<<<rgridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  initt = dtime_usec(dtime);
  kernel2<<<gridSize,blockSize>>>(states, gridSize);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  dtime = dtime_usec(dtime) - initt;
  cudaMemcpyFromSymbol(&hcount, count, sizeof(unsigned long long));
  std::cout << "method 3 elapsed time: " << dtime/(float)USECPSEC << " init time: " << initt/(float)USECPSEC << " pi: " << 4.0f*hcount/(float)(gridSize*gridSize*blockSize) << std::endl;
  hcount = 0;
  cudaMemcpyToSymbol(count, &hcount, sizeof(unsigned long long));
  return 0;
}

$ nvcc -arch=sm_35 -O3 -o t1201 t1201.cu
$ ./t1201 28
Method 1 init blocks: 784
method 1 elapsed time: 0.001218 init time: 3.91075 pi: 3.14019
Method 2 init blocks: 28
method 2 elapsed time: 0.00117 init time: 0.003629 pi: 3.14013
Method 3 sms: 14 init blocks: 28
method 3 elapsed time: 0.001193 init time: 0.003622 pi: 3.1407
$

For method 3, an additional level of description of the concept is given here

这篇关于最小化MC模拟过程中存储的cuRAND状态数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆