cudaMalloc全局数组导致段错误 [英] cudaMalloc global array cause seg fault
问题描述
当我尝试从设备执行的函数访问全局数组时,发现了一些困难:
I found some difficulty when I try to access a global array from function that's executed from device:
float globTemp[3][3] = "some value in here";
__device__ float* globTemp_d;
__global__ void compute(int *a, int w)
{
int x = threadIdx.x + blockDim.x * blockIdx.x;
int y = threadIdx.y + blockDim.y * blockIdx.y;
int i = y*w+x;
if(x<3 && y<3)
a[i] = 1+globTemp_d[i];
}
int hostFunc(){
float *a_d;
cudaMalloc((void**)&a_d, 3*3*sizeof(int));
cudaMalloc((void**)&globTemp_d, 3*3*sizeof(int));
cudaMemcpy(globTemp_d,globTemp, 3*3*sizeof(float), cudaMemcpyHostToDevice);
compute<<<1,1>>>(a_d,3);
cudaMemcpy(a,a_d, 3*3*sizeof(float), cudaMemcpyDeviceToHost);
}
但是,当我尝试访问globTemp_d [i]时出现段错误。我在这里做错了吗?
However, I get seg fault when i try to access globTemp_d[i]. Am I doing something wrong in here?
推荐答案
您的代码存在多种问题:
There are a variety of problems with your code:
- 您的网格是1D线程块的1D网格(实际上您正在启动一个1线程的块),但是编写内核时就好像它在期待2D线程块一样结构(使用
.x
和.y
内置变量)。单线程肯定无法完成工作,一维线程块无法与您的内核代码一起使用。 -
__ device __
变量是未访问使用cudaMalloc
和cudaMemcpy
。我们使用一组不同的API调用,例如cudaMemcpyToSymbol
。 - 您没有执行任何CUDA错误检查,当您始终建议使用有困难。您应该对API调用和内核调用都进行cuda错误检查。
- 您正在混合
float
变量(int_ code>)在内核参数中具有
int
变量(int * a
),因此我认为至少在没有警告的情况下,这段代码就不会编译。当然,如果您忽略它,可能会导致奇怪的行为。
- Your grid is a 1D grid of 1D threadblocks (in fact you are launching a single block of 1 thread) but your kernel is written as if it were expecting a 2D threadblock structure (using
.x
and.y
built-in variables). A single thread won't get the work done certainly, and a 1D threadblock won't work with your kernel code. __device__
variables are not accessed withcudaMalloc
andcudaMemcpy
. We use a different set of API calls likecudaMemcpyToSymbol
.- You're not doing any cuda error checking which is always recommended when you're having difficulty. You should do cuda error checking on both API calls and kernel calls.
- You're mixing
float
variables (a_d
) withint
variables in the kernel parameters (int *a
) so I don't think this code would compile without at least a warning. And that can lead to strange behavior of course if you ignore it.
这是我在修复所有代码时能最接近您的代码的地方错误:
This is the closest I could come to your code while fixing all the errors:
#include <stdio.h>
__device__ float* globTemp_d;
__global__ void compute(float *a, int w)
{
int x = threadIdx.x + blockDim.x * blockIdx.x;
int y = threadIdx.y + blockDim.y * blockIdx.y;
int i = (y*w)+x;
if((x<3) && (y<3))
a[i] = 1.0f+globTemp_d[i];
}
int main(){
float *a_d, *d_globTemp;
float globTemp[3][3] = {0.1f, 0.2f, 0.3f, 0.4f, 0.5f, 0.6f, 0.7f, 0.8f, 0.9f};
float a[(3*3)];
dim3 threads(3,3);
dim3 blocks(1);
cudaMalloc((void**)&a_d, 3*3*sizeof(float));
cudaMalloc((void**)&d_globTemp, 3*3*sizeof(float));
cudaMemcpy(d_globTemp,globTemp, 3*3*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(globTemp_d, &d_globTemp, sizeof(float *));
compute<<<blocks,threads>>>(a_d,3);
cudaMemcpy(a,a_d, 3*3*sizeof(float), cudaMemcpyDeviceToHost);
printf("results:\n");
for (int i = 0; i<(3*3); i++)
printf("a[%d] = %f\n", i, a[i]);
return 0;
}
可以通过省去 __ device__来简化此代码
变量,只是将 d_globTemp
作为参数传递给内核,并使用它代替对 globTemp_d $的引用c $ c>。但是我并没有简化。
This code can be simplified by dispensing with the __device__
variable and just passing d_globTemp
as a parameter to the kernel, and using it in place of references to globTemp_d
. However I did not make that simplification.
这篇关于cudaMalloc全局数组导致段错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!