pycuda共享内存错误“ pycuda._driver.LogicError:cuLaunchKernel失败:无效值” [英] pycuda shared memory error "pycuda._driver.LogicError: cuLaunchKernel failed: invalid value"
问题描述
我有一个奇怪的问题,我无法确定其起源:
I have a strange problem which origin I cannot determine:
我有一个工作的内核,可以加速某些特殊的Matrix-Vector乘法。基本上,大矩阵(10 ^ 6乘以10 ^ 6)是由少量小矩阵构成的。因此,我想将该数据放入共享内存中。但是,当我尝试添加共享内存时,只会出现以下错误:
I have a working Kernel for some special Matrix-Vector-multiplication, which I want to speed up. Basically the big matrix (10^6 times 10^6) is constructed from few small matrices. So I want to put that data in shared memory. However when I try to add the shared memory, I only get the error:
pycuda._driver.LogicError:cuLaunchKernel失败:无效值
pycuda._driver.LogicError: cuLaunchKernel failed: invalid value
所以我的工作内核是:
#define FIELD_SIZE {field}
#define BLOCK_SIZE {block}
__global__ void MatrixMulKernel(double *gpu_matrix, double *gpu_b, double *gpu_y)
{
int tx = ... + threadIdx.x;
if(tx < FIELD_SIZE*FIELD_SIZE*BLOCK_SIZE)
{ ... multiplication ... }
}
如果我尝试添加共享内存部分,它看起来就像
And if I try to add the shared memory part it looks like
#define FIELD_SIZE {field}
#define BLOCK_SIZE {block}
__global__ void MatrixMulKernel(double *gpu_matrix_ptr, double *gpu_b, double *gpu_y)
{
__shared__ double gpu_matrix[BLOCK_SIZE*BLOCK_SIZE*13];
int tx = ... + threadIdx.x;
if(tx < BLOCK_SIZE*BLOCK_SIZE*13) { gpu_matrix[tx] = gpu_matrix_ptr[tx]; }
__syncthreads();
if(tx < FIELD_SIZE*FIELD_SIZE*BLOCK_SIZE)
{ ... multiplication ... }
}
这是我更改的唯一部分,因此基本上它必须是gpu_matrix [tx] = gpu_matrix_ptr [tx]语句,没有吗?但是我不知道应该怎么做。我基本上试图从pycuda示例复制平铺的矩阵乘法示例。 http://wiki.tiker.net/PyCuda/Examples/MatrixmulTiled
This is the only part I changed, so basically it has to be the gpu_matrix[tx] = gpu_matrix_ptr[tx] statement, hasnt it? But I fail to see how that should be. I basically tried to copy the tiled matrix-multiplication example from the pycuda examples. http://wiki.tiker.net/PyCuda/Examples/MatrixmulTiled
调用是:
self.kernel.prepare([np.intp, np.intp, np.intp])
self.kernel.prepared_call(grid_shape,
block_shape,
self.matrix_gpu.gpudata,
b_gpu.gpudata,
y_gpu.gpudata)
其中matrix_gpu,b_gpu和y_gpu是pycuda.gpuarray实例。
where matrix_gpu, b_gpu and y_gpu are pycuda.gpuarray instances.
希望您可以消除我的一些困惑...
Hope that you can clear up some of my confusion...
推荐答案
根据您的描述,您分配的共享内存太大。
According to your description, the shared mem your allocated is too big.
__shared__ double gpu_matrix[BLOCK_SIZE*BLOCK_SIZE*13];
共享内存是cuda gpu的硬件资源之一。总大小约为48 KB,您不能增加。
shared mem is one of the hardware resources of cuda gpu. the total size is about 48KBytes, which you can not increase.
CUDA实际上在以下目录中提供了一个工具来帮助您计算可以使用的硬件资源。
CUDA actually provides a tool in the following dir to help you calculate the hardware resources you can use.
$CUDA_ROOT/tools/CUDA_Occupancy_Calculator.xls
另一方面,类似mat-vec-mul的内核所需的共享内存大小应能够从O( BLOCK_SIZE ^ 2
)到O( BLOCK_SIZE
)。在实施之前,您可能需要阅读一些成功的Mat-vec-mul内核的代码,例如 MAGMA 你自己的。
On the other hand, the size of shared mem required by mat-vec-mul-like kernels should be able to reduce from O(BLOCK_SIZE^2
) to O(BLOCK_SIZE
). You may want to read code of some successful mat-vec-mul kernels such as MAGMA before implement your own.
这篇关于pycuda共享内存错误“ pycuda._driver.LogicError:cuLaunchKernel失败:无效值”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!