cudaMalloc全局数组导致段错误 [英] cudaMalloc global array cause seg fault

查看:104
本文介绍了cudaMalloc全局数组导致段错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试从设备执行的函数访问全局数组时,发现了一些困难:

I found some difficulty when I try to access a global array from function that's executed from device:

float globTemp[3][3] = "some value in here";
__device__ float* globTemp_d;

__global__ void compute(int *a, int w)
{
  int x = threadIdx.x + blockDim.x * blockIdx.x;
  int y = threadIdx.y + blockDim.y * blockIdx.y;
  int i = y*w+x;
  if(x<3 && y<3)
    a[i] = 1+globTemp_d[i];
}

int hostFunc(){
   float *a_d;
   cudaMalloc((void**)&a_d, 3*3*sizeof(int));
   cudaMalloc((void**)&globTemp_d, 3*3*sizeof(int));
   cudaMemcpy(globTemp_d,globTemp, 3*3*sizeof(float), cudaMemcpyHostToDevice);
   compute<<<1,1>>>(a_d,3);
   cudaMemcpy(a,a_d, 3*3*sizeof(float), cudaMemcpyDeviceToHost);
}

但是,当我尝试访问globTemp_d [i]时出现段错误。我在这里做错了吗?

However, I get seg fault when i try to access globTemp_d[i]. Am I doing something wrong in here?

推荐答案

您的代码存在多种问题:

There are a variety of problems with your code:


  1. 您的网格是1D线程块的1D网格(实际上您正在启动一个1线程的块),但是编写内核时就好像它在期待2D线程块一样结构(使用 .x .y 内置变量)。单线程肯定无法完成工作,一维线程块无法与您的内核代码一起使用。

  2. __ device __ 变量是未访问使用 cudaMalloc cudaMemcpy 。我们使用一组不同的API调用,例如 cudaMemcpyToSymbol

  3. 您没有执行任何CUDA错误检查,当您始终建议使用有困难。您应该对API调用内核调用都进行cuda错误检查。

  4. 您正在混合 float 变量( int_ code>)在内核参数中具有 int 变量( int * a ),因此我认为至少在没有警告的情况下,这段代码就不会编译。当然,如果您忽略它,可能会导致奇怪的行为。

  1. Your grid is a 1D grid of 1D threadblocks (in fact you are launching a single block of 1 thread) but your kernel is written as if it were expecting a 2D threadblock structure (using .x and .y built-in variables). A single thread won't get the work done certainly, and a 1D threadblock won't work with your kernel code.
  2. __device__ variables are not accessed with cudaMalloc and cudaMemcpy. We use a different set of API calls like cudaMemcpyToSymbol.
  3. You're not doing any cuda error checking which is always recommended when you're having difficulty. You should do cuda error checking on both API calls and kernel calls.
  4. You're mixing float variables (a_d ) with int variables in the kernel parameters (int *a) so I don't think this code would compile without at least a warning. And that can lead to strange behavior of course if you ignore it.

这是我在修复所有代码时能最接近您的代码的地方错误:

This is the closest I could come to your code while fixing all the errors:

#include <stdio.h>

__device__ float* globTemp_d;

__global__ void compute(float *a, int w)
{
  int x = threadIdx.x + blockDim.x * blockIdx.x;
  int y = threadIdx.y + blockDim.y * blockIdx.y;
  int i = (y*w)+x;
  if((x<3) && (y<3))
    a[i] = 1.0f+globTemp_d[i];
}

int main(){
   float *a_d, *d_globTemp;
   float globTemp[3][3] = {0.1f, 0.2f, 0.3f, 0.4f, 0.5f, 0.6f, 0.7f, 0.8f, 0.9f};
   float a[(3*3)];
   dim3 threads(3,3);
   dim3 blocks(1);
   cudaMalloc((void**)&a_d, 3*3*sizeof(float));
   cudaMalloc((void**)&d_globTemp, 3*3*sizeof(float));
   cudaMemcpy(d_globTemp,globTemp, 3*3*sizeof(float), cudaMemcpyHostToDevice);
   cudaMemcpyToSymbol(globTemp_d, &d_globTemp, sizeof(float *));
   compute<<<blocks,threads>>>(a_d,3);
   cudaMemcpy(a,a_d, 3*3*sizeof(float), cudaMemcpyDeviceToHost);

   printf("results:\n");
   for (int i = 0; i<(3*3); i++)
     printf("a[%d] = %f\n", i, a[i]);
   return 0;
}

可以通过省去 __ device__来简化此代码变量,只是将 d_globTemp 作为参数传递给内核,并使用它代替对 globTemp_d 。但是我并没有简化。

This code can be simplified by dispensing with the __device__ variable and just passing d_globTemp as a parameter to the kernel, and using it in place of references to globTemp_d. However I did not make that simplification.

这篇关于cudaMalloc全局数组导致段错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆