丢失在CUDA设备指针 [英] Lost in CUDA device pointers

查看:264
本文介绍了丢失在CUDA设备指针的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我的论文工作的一部分,我在一个CUDA项目(修改某人elses代码,添加功能等)。作为CUDA的新手,这对我来说是一个真正的挑战。我使用计算能力1.3 卡,4 x Tesla C1060。不幸的是,我碰到了平台的一些限制。

As part of my thesis work I am working in a CUDA project (modifying somebody elses code, adding functionality, etc). Being new to CUDA this is turning to be a real challenge for me. I am working with compute capability 1.3 cards, 4 x Tesla C1060. And sadly, I am hitting some limitations of the platform.

我需要传递一些新的结构到设备,我相信已经被正确复制。但是,当尝试在我的内核调用上传递指针到设备上的结构时,我达到256个字节的限制(如在问题)。

I need to pass a couple of new structures to device, which I believe are been copied correctly. But, when trying to pass the pointers to structure on device on my kernel call I reach the 256 bytes limit (as addressed in this question).

我的代码如下:

// main.cu
static void RunGPU(HostThreadState *hstate)
{
  SimState *HostMem = &(hstate->host_sim_state);
  SimState DeviceMem;

  TetrahedronStructGPU *h_root = &(hstate->root);
  TetrahedronStructGPU *d_root;
  TriangleFacesGPU *h_faces = &(hstate->faces);
  TriangleFacesGPU *d_faces;

  GPUThreadStates tstates;

  unsigned int n_threads = hstate->n_tblks * NUM_THREADS_PER_BLOCK;
  unsigned int n_tetras  = hstate->n_tetras; // 9600
  unsigned int n_faces   = hstate->n_faces;  // 38400

  InitGPUStates(HostMem, h_root, h_faces, &DeviceMem, &tstates, hstate->sim, 
                d_root, d_faces, n_threads, n_tetras, n_faces );
  cudaThreadSynchronize();

  ...

  kernel<<<dimGrid, dimBlock, k_smem_sz>>>(DeviceMem, tstates, /*OK, these 2*/
                                           d_root, d_faces);
                           // Limit of 256 bytes adding d_root and/or d_faces
  cudaThreadSynchronize();

  ...

}

InitGPUStates 函数在另一个源文件中:

The InitGPUStates function is in another source file:

// kernel.cu
int InitGPUStates(SimState* HostMem, TetrahedronStructGPU* h_root,
                  TriangleFacesGPU* h_faces,
                  SimState* DeviceMem, GPUThreadStates *tstates,
                  SimulationStruct* sim, 
                  TetrahedronStructGPU* d_root, TriangleFacesGPU* d_faces,
                  int n_threads, int n_tetras, int n_faces)
{
  unsigned int size;

  // Allocate and copy RootTetrahedron (d_root) on device
  size = n_tetras * sizeof(TetrahedronStructGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_root, size));
  checkCudaErrors(cudaMemcpy(d_root, h_root, size, cudaMemcpyHostToDevice));

  // Allocate and copy Faces (d_faces) on device
  size = n_faces * sizeof(TriangleFacesGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_faces, size));
  checkCudaErrors(cudaMemcpy(d_faces, h_faces, size, cudaMemcpyHostToDevice));     

  ...
}

以仅将指针传递到设备存储器上的位置。如何获取设备中的地址?

I understand that I need to pass only pointers to the locations on device memory. How can I get the address in device? Is this passing of pointers correctly done?

这两个新结构是:

// header.h
typedef struct {
  int idx;
  int vertices[4];
  float Nx, Ny, Nz, d;
} TriangleFacesGPU;

typedef struct {
  int idx, region;
  int vertices[4], faces[4], adjTetras[4];
  float n, mua, mus, g;
} TetrahedronStructGPU;

// other structures
typedef struct {
  BOOLEAN *is_active;
  BOOLEAN *dead;
  BOOLEAN *FstBackReflectionFlag;
  int *NextTetrahedron;
  UINT32 *NumForwardScatters;
  UINT32 *NumBackwardScatters;
  UINT32 *NumBackwardsSpecularReflections;
  UINT32 *NumBiases;
  UINT32 *p_layer;
  GFLOAT *p_x, *p_y, *p_z;
  GFLOAT *p_ux, *p_uy, *p_uz;
  GFLOAT *p_w;
  GFLOAT *Rspecular;
  GFLOAT *LocationFstBias;
  GFLOAT *OpticalPath;
  GFLOAT *MaxDepth;
  GFLOAT *MaxLikelihoodRatioIncrease;
  GFLOAT *LikelihoodRatioIncreaseFstBias;
  GFLOAT *LikelihoodRatio;
  GFLOAT *LikelihoodRatioAfterFstBias;
  GFLOAT *s, *sleft;
  TetrahedronStructGPU *tetrahedron;
  TriangleFacesGPU *faces;
} GPUThreadStates;

typedef struct {
  UINT32 *n_p_left;
  UINT64 *x;
  UINT32 *a;
  UINT64 *Rd_ra;
  UINT64 *A_rz;
  UINT64 *Tt_ra;
} SimState;

kernel 的定义是

__global__ void kernel(SimState d_state, GPUThreadStates tstates,
                       TetrahedronStructGPU *d_root,
                       TriangleFacesGPU *d_faces);

我将改变 SimState d_state 指针传递 SimState * d_state 。以及 GPUThreadStates tstates GPUThreadStates * tstates

I will work on changing SimState d_state to pointer pass SimState *d_state. As well as GPUThreadStates tstates to GPUThreadStates *tstates.

推荐答案

最后,解决了256字节的问题。但是,仍然失去指针

Finally, solved the 256 bytes issue. But, really still lost in pointers

我修改的代码如下:

// main.cu
static void RunGPU(HostThreadState *hstate)
{
  SimState *HostMem = &(hstate->host_sim_state);

  // new pointers to pass
  SimState *DeviceMem = (SimState*)malloc(sizeof(SimState));
  GPUThreadStates *tstates = (GPUThreadStates*)malloc(sizeof(GPUThreadStates));

  TetrahedronStructGPU *h_root = hstate->root; //root, pointer in HostThreadState
  TetrahedronStructGPU *d_root;
  TriangleFacesGPU *h_faces = hstate->faces; //faces, pointer in HostThreadState
  TriangleFacesGPU *d_faces;

  unsigned int n_threads = hstate->n_tblks * NUM_THREADS_PER_BLOCK;
  unsigned int n_tetras  = hstate->n_tetras; // 9600
  unsigned int n_faces   = hstate->n_faces;  // 38400

  InitGPUStates(HostMem, h_root, h_faces, DeviceMem, tstates, hstate->sim, 
                d_root, d_faces, n_threads, n_tetras, n_faces );
  cudaThreadSynchronize();

  ...

  kernel<<<dimGrid, dimBlock, k_smem_sz>>>(DeviceMem, tstates,
                                           d_root, d_faces);
                                         // No limit reached!
  cudaThreadSynchronize();

  ...      
}

$ c> InitGPUStates 函数的修改如下。特别注意DeviceMem的副本(我试过很多形式没有成功)。一些形式(带圆括号,如 cudaMalloc((void **)&(* DeviceMem).n_p_left,size))不会给我任何错误。我假设没有错误意味着没有数据复制到设备。在当前形式中,错误是 code = 11(cudaErrorInvalidValue)cudaMalloc((void **)& DeviceMem-> n_photons_left,size)

In the InitGPUStates function the changes are as follow. Special attention to the copy of DeviceMem (I tried many forms without success). Some forms (with parenthesis, like this cudaMalloc((void **)&(*DeviceMem).n_p_left, size)) will not give me any error. I am assuming that no errors means no data copied to device. In the current form the error is code=11(cudaErrorInvalidValue) "cudaMalloc((void**)&DeviceMem->n_photons_left, size)".

// kernel.cu
int InitGPUStates(SimState* HostMem, TetrahedronStructGPU* h_root,
                  TriangleFacesGPU* h_faces,
                  SimState* DeviceMem, GPUThreadStates *tstates,
                  SimulationStruct* sim, 
                  TetrahedronStructGPU* d_root, TriangleFacesGPU* d_faces,
                  int n_threads, int n_tetras, int n_faces)
{
  unsigned int size;

  // Allocate and copy RootTetrahedron (d_root) on device
  size = n_tetras * sizeof(TetrahedronStructGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_root, size));
  checkCudaErrors(cudaMemcpy(d_root, h_root, size, cudaMemcpyHostToDevice));

  // Allocate and copy Faces (d_faces) on device
  size = n_faces * sizeof(TriangleFacesGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_faces, size));
  checkCudaErrors(cudaMemcpy(d_faces, h_faces, size, cudaMemcpyHostToDevice));     

  // HELP NEEDED MAINLY FROM HERE REGARDING POINTER VALUE COPY!
  checkCudaErrors( cudaMalloc((void**)&DeviceMem, sizeof(SimState) ); //Needed?

  size = sizeof(UINT32);
  checkCudaErrors( cudaMalloc(&DeviceMem->n_p_left, size) );
  checkCudaErrors( cudaMemcpy(DeviceMem->n_p_left,
                   HostMem->n_p_left, size, cudaMemcpyHostToDevice) );

  size = n_threads * sizeof(UINT32);
  checkCudaErrors( cudaMalloc(&DeviceMem->a, size) );
  checkCudaErrors( cudaMemcpy(DeviceMem->a, HostMem->a, size,
                                      cudaMemcpyHostToDevice) );
  size = n_threads * sizeof(UINT64);
  checkCudaErrors( cudaMalloc(&DeviceMem->x, size) );
  checkCudaErrors( cudaMemcpy(DeviceMem->x, HostMem->x, size,
                                      cudaMemcpyHostToDevice) );
  ...
}

我知道我需要只传递指针到设备内存上的位置,如何获取设备中的地址?

I understand that I need to pass only pointers to the locations on device memory. How can I get the address in device? Is this passing of pointers correctly done?

这两个新结构是:

// header.h
typedef struct {
  int idx;
  int vertices[4];
  float Nx, Ny, Nz, d;
} TriangleFacesGPU;

typedef struct {
  int idx, region;
  int vertices[4], faces[4], adjTetras[4];
  float n, mua, mus, g;
} TetrahedronStructGPU;

// other structures
typedef struct {
  BOOLEAN *is_active;
  BOOLEAN *dead;
  BOOLEAN *FstBackReflectionFlag;
  int *NextTetrahedron;
  UINT32 *NumForwardScatters;
  UINT32 *NumBackwardScatters;
  UINT32 *NumBackwardsSpecularReflections;
  UINT32 *NumBiases;
  UINT32 *p_layer;
  GFLOAT *p_x, *p_y, *p_z;
  GFLOAT *p_ux, *p_uy, *p_uz;
  GFLOAT *p_w;
  GFLOAT *Rspecular;
  GFLOAT *LocationFstBias;
  GFLOAT *OpticalPath;
  GFLOAT *MaxDepth;
  GFLOAT *MaxLikelihoodRatioIncrease;
  GFLOAT *LikelihoodRatioIncreaseFstBias;
  GFLOAT *LikelihoodRatio;
  GFLOAT *LikelihoodRatioAfterFstBias;
  GFLOAT *s, *sleft;
  TetrahedronStructGPU *tetrahedron;
  TriangleFacesGPU *faces;
} GPUThreadStates;

typedef struct {
  UINT32 *n_p_left;
  UINT64 *x;
  UINT32 *a;
  UINT64 *Rd_ra;
  UINT64 *A_rz;
  UINT64 *Tt_ra;
} SimState;

kernel 的定义更改为:

__global__ void kernel(SimState *d_state, GPUThreadStates *tstates,
                       TetrahedronStructGPU *d_root,
                       TriangleFacesGPU *d_faces);

这篇关于丢失在CUDA设备指针的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆