CUDA:使用CUSPARSE csrmv()例程的映射错误 [英] CUDA: Mapping Error using CUSPARSE csrmv() routine
问题描述
我目前正在尝试使用CUSPARSE库,以加快HPCG的实施速度。但是,看来我在设备数据分配期间犯了某种错误。
I'm currently trying to use the CUSPARSE library in order to speed up an HPCG implementation. However, it appears I'm making some kind of mistake during device data allocation.
这是导致 CUSPARSE_STATUS_MAPPING_ERROR 的代码段: / p>
This is the code segment that results in CUSPARSE_STATUS_MAPPING_ERROR:
int HPC_sparsemv( CRS_Matrix *A_crs_d,
FP * x_d, FP * y_d)
{
FP alpha = 1.0f;
FP beta = 0.0f;
FP* vals = A_crs_d->vals;
int* inds = A_crs_d->col_ind;
int* row_ptr = A_crs_d->row_ptr;
/*generate Matrix descriptor for SparseMV computation*/
cusparseMatDescr_t matDescr;
cusparseCreateMatDescr(&matDescr);
cusparseStatus_t status;
/*hand off control to CUSPARSE routine*/
#ifdef DOUBLE
status = cusparseDcsrmv(cuspHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, A_crs_d->nrows,
A_crs_d->ncols,A_crs_d->nnz, &alpha, matDescr, vals, row_ptr,
inds, x_d, &beta, y_d);
#else
status = cusparseScsrmv(cuspHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, A_crs_d->nrows,
A_crs_d->ncols,A_crs_d->nnz, &alpha, matDescr, vals, row_ptr,
col_ind, x_d, &beta, y_d);
#endif
注意:FP是由条件编译保护包装的typedef ,这意味着它在编译时被评估为float或double别名。
NOTE: FP is a typedef wrapped by conditional compilation guards, meaning it gets evaluated to be either a float or a double alias at compile-time.
这是处理数据分配的函数:
And here is the function handling the data allocation:
int cudaAlloc(FP* r_d, FP* p_d, FP* Ap_d, FP* b_d, const FP* const b, FP * x_d, FP * const x,
struct CRS_Matrix* A_crs_d, int nrows, int ncols, int nnz){
std::cout << "Beginning device allocation..." << std::endl;
int size_r = nrows * sizeof(FP);
int size_c = ncols * sizeof(FP);
int size_nnz = nnz * sizeof(FP);
int allocStatus = 0;
/*device alloc r_d*/
allocStatus |= (int) checkCuda( cudaMalloc((void **) &r_d, size_r) );
/*device alloc p_d*/
allocStatus |= (int) checkCuda( cudaMalloc((void **) &p_d, size_c) );
/*device alloc Ap_d*/
allocStatus |= (int) checkCuda( cudaMalloc((void **) &Ap_d, size_r) );
/*device alloc b_d*/
allocStatus |= (int) checkCuda( cudaMalloc((void **) &b_d, size_r ) );
allocStatus |= (int) checkCuda( cudaMemcpy(b_d, b, size_r, cudaMemcpyHostToDevice));
/*device alloc x_d*/
allocStatus |= (int) checkCuda( cudaMalloc((void **) &x_d, size_r ) );
allocStatus |= (int) checkCuda( cudaMemcpy(x_d, x, size_r, cudaMemcpyHostToDevice));
/*device alloc A_crs_d*/
FP * valtmp;
allocStatus |= (int) checkCuda( cudaMalloc((void **) &valtmp, size_nnz) );
allocStatus |= (int) checkCuda( cudaMemcpy(valtmp, CRS->vals, size_nnz, cudaMemcpyHostToDevice) );
int * indtmp;
allocStatus |= (int) checkCuda( cudaMalloc((void **) &indtmp, nnz* sizeof(int)) );
allocStatus |= (int) checkCuda( cudaMemcpy(indtmp, CRS->col_ind,
nnz * sizeof(int) , cudaMemcpyHostToDevice) );
int * rowtmp;
allocStatus |= (int) checkCuda( cudaMalloc((void **) &rowtmp, (nrows + 1) * sizeof(int)) );
allocStatus |= (int) checkCuda( cudaMemcpy(rowtmp, CRS->row_ptr,
(nrows + 1) * sizeof(int), cudaMemcpyHostToDevice) );
allocStatus |= (int) checkCuda( cudaMallocHost( &A_crs_d, sizeof(CRS_Matrix)) );
A_crs_d->vals = valtmp;
A_crs_d->col_ind = indtmp;
A_crs_d->row_ptr = rowtmp;
A_crs_d->nrows = CRS->nrows;
A_crs_d->ncols = CRS->ncols;
A_crs_d->nnz = CRS->nnz;
std::cout << "Device allocation done." << std::endl;
return allocStatus;
}
在我第一次停留StackOverflow的过程中,我发现此问题已由其他人发布: 在使用cuda常量内存时出现的Cusparse状态映射错误
During my first stop at StackOverflow I found this solved issue posted by somebody else: Cusparse status mapping error while using cuda constant memory
但是,由于我没有在传递给csrmv()的参数上使用常量内存,因此无法解决我的问题。我还检查了数据完整性,并且设备上的CRS_Matrix与主机内存中的原始数据完全匹配。
However, as I'm not using constant memory on the arguments passed to csrmv() that didn't solve my problem. I also checked data integrity and the CRS_Matrix on the device exactly matches the original in host memory.
这个问题让我很茫然,找不到任何东西表示CUDA工具包文档中存在问题,因此将不胜感激。
I'm quite at a loss with this issue and couldn't find anything that would indicate a problem in the CUDA Toolkit Documentation, so any help would be greatly appreciated.
预先感谢。
推荐答案
您显示的代码中存在一些错误。
There are some errors in the code you have shown.
-
无法将值传递给例程的指针参数,对该指针执行
cudaMalloc
操作,然后期望该结果显示在调用环境中。您正在为x_d
,b_d
和A_crs_d
(带有cudaMallocHost
)参数,您将传递给cudaAlloc
。一种可能的解决方法是将这些参数作为例程中的双指针(**
)参数进行处理,并将指针的地址传递给例程。这允许修改后的指针值显示在调用环境中。这确实是一个正确的C编码问题,并非特定于CUDA。
It's not possible to pass-by-value a pointer parameter to a routine, perform a
cudaMalloc
operation on that pointer, and then expect that result to show up in the calling environment. You are doing this for thex_d
,b_d
, andA_crs_d
(withcudaMallocHost
) parameters that you are passing tocudaAlloc
. One possible fix is to handle those parameters as double pointer (**
) parameters within the routine, and pass the address of the pointer to the routine. This allows the modified pointer value to show up in the calling environment. This is really a question of proper C coding, and is not specific to CUDA.
至少对于 cudaAlloc
,看来您打算实现 Ax = b
。在这种情况下, x
向量的长度是 A
的列的数量,而 b
向量的长度是 A
的行的数量。在您的 cudaAlloc
例程中,将这两者都分配为 A
的行的大小,因此可以是正确的。这也会影响随后的 cudaMemcpy
操作(大小)。
At least with respect to cudaAlloc
, it appears that you intend to implement Ax=b
. In that case, the length of the x
vector is the number of columns of A
, and the length of the b
vector is the number of rows of A
. In your cudaAlloc
routine, you are allocating both of these as the size of the rows of A
, so this can't be correct. This also affects the subsequent cudaMemcpy
operation (size).
似乎您显示的代码仅在 double
情况下进行了测试,因为传递给每个调用的colum index参数有所不同(大概是 float
和 double
)。无论如何,我已经围绕您所显示的内容(针对 double
的情况)构建了完整的代码,加上上述更改,并且该代码运行无误并生成了正确的代码对我来说是结果:
It appears that the code you have shown was only tested for the double
case, since there is a difference the colum index parameter you are passing to each call (presumably for float
and double
). In any event, I've built a complete code around what you have shown (for the double
case), plus the above changes, and it runs without error and produces the correct result for me:
$ cat t1216.cu
#include <cusparse.h>
#include <iostream>
#define checkCuda(x) x
#ifdef USE_FLOAT
typedef float FP;
#else
#define DOUBLE
typedef double FP;
#endif
struct CRS_Matrix{
FP *vals;
int *col_ind;
int *row_ptr;
int ncols;
int nnz;
int nrows;
} *CRS;
cusparseHandle_t cuspHandle;
int HPC_sparsemv( CRS_Matrix *A_crs_d,
FP * x_d, FP * y_d)
{
FP alpha = 1.0f;
FP beta = 0.0f;
FP* vals = A_crs_d->vals;
int* inds = A_crs_d->col_ind;
int* row_ptr = A_crs_d->row_ptr;
/*generate Matrix descriptor for SparseMV computation*/
cusparseMatDescr_t matDescr;
cusparseCreateMatDescr(&matDescr);
cusparseStatus_t status;
/*hand off control to CUSPARSE routine*/
#ifdef DOUBLE
status = cusparseDcsrmv(cuspHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, A_crs_d->nrows,
A_crs_d->ncols,A_crs_d->nnz, &alpha, matDescr, vals, row_ptr,
inds, x_d, &beta, y_d);
#else
status = cusparseScsrmv(cuspHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, A_crs_d->nrows,
A_crs_d->ncols,A_crs_d->nnz, &alpha, matDescr, vals, row_ptr,
col_ind, x_d, &beta, y_d); // col_ind here should probably be inds
#endif
return (int)status;
}
int cudaAlloc(FP* r_d, FP* p_d, FP* Ap_d, FP** b_d, const FP* const b, FP ** x_d, FP * const x,
struct CRS_Matrix** A_crs_d, int nrows, int ncols, int nnz){
std::cout << "Beginning device allocation..." << std::endl;
int size_r = nrows * sizeof(FP);
int size_c = ncols * sizeof(FP);
int size_nnz = nnz * sizeof(FP);
int allocStatus = 0;
/*device alloc r_d*/
allocStatus |= (int) checkCuda( cudaMalloc((void **) &r_d, size_r) );
/*device alloc p_d*/
allocStatus |= (int) checkCuda( cudaMalloc((void **) &p_d, size_c) );
/*device alloc Ap_d*/
allocStatus |= (int) checkCuda( cudaMalloc((void **) &Ap_d, size_r) );
/*device alloc b_d*/
allocStatus |= (int) checkCuda( cudaMalloc((void **) b_d, size_r ) );
allocStatus |= (int) checkCuda( cudaMemcpy(*b_d, b, size_r, cudaMemcpyHostToDevice));
/*device alloc x_d*/
allocStatus |= (int) checkCuda( cudaMalloc((void **) x_d, size_c ) );
allocStatus |= (int) checkCuda( cudaMemcpy(*x_d, x, size_c, cudaMemcpyHostToDevice));
/*device alloc A_crs_d*/
FP * valtmp;
allocStatus |= (int) checkCuda( cudaMalloc((void **) &valtmp, size_nnz) );
allocStatus |= (int) checkCuda( cudaMemcpy(valtmp, CRS->vals, size_nnz, cudaMemcpyHostToDevice) );
int * indtmp;
allocStatus |= (int) checkCuda( cudaMalloc((void **) &indtmp, nnz* sizeof(int)) );
allocStatus |= (int) checkCuda( cudaMemcpy(indtmp, CRS->col_ind,
nnz * sizeof(int) , cudaMemcpyHostToDevice) );
int * rowtmp;
allocStatus |= (int) checkCuda( cudaMalloc((void **) &rowtmp, (nrows + 1) * sizeof(int)) );
allocStatus |= (int) checkCuda( cudaMemcpy(rowtmp, CRS->row_ptr,
(nrows + 1) * sizeof(int), cudaMemcpyHostToDevice) );
allocStatus |= (int) checkCuda( cudaMallocHost( A_crs_d, sizeof(CRS_Matrix)) );
(*A_crs_d)->vals = valtmp;
(*A_crs_d)->col_ind = indtmp;
(*A_crs_d)->row_ptr = rowtmp;
(*A_crs_d)->nrows = CRS->nrows;
(*A_crs_d)->ncols = CRS->ncols;
(*A_crs_d)->nnz = CRS->nnz;
std::cout << "Device allocation done." << std::endl;
return allocStatus;
}
int main(){
CRS = (struct CRS_Matrix *)malloc(sizeof(struct CRS_Matrix));
cusparseCreate(&cuspHandle);
// simple test matrix
#define M0_M 5
#define M0_N 5
FP m0_csr_vals[] = {2.0f, 1.0f, 1.0f, 2.0f, 1.0f, 1.0f, 2.0f, 1.0f, 1.0f, 2.0f, 1.0f, 1.0f, 2.0f};
int m0_col_idxs[] = { 0, 1, 0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4};
int m0_row_ptrs[] = { 0, 2, 5, 8, 11, 13};
FP m0_d[] = {1.0f, 1.0f, 1.0f, 1.0f, 1.0f};
int m0_nnz = 13;
FP *r_d, *p_d, *Ap_d, *b_d, *x_d;
FP *b = new FP[M0_N];
CRS_Matrix *A_crs_d;
CRS->vals = m0_csr_vals;
CRS->col_ind = m0_col_idxs;
CRS->row_ptr = m0_row_ptrs;
CRS->nrows = M0_M;
CRS->ncols = M0_N;
CRS->nnz = m0_nnz;
// Ax = b
// r_d, p_d, Ap_d ??
int stat = cudaAlloc(r_d, p_d, Ap_d, &b_d, b, &x_d, m0_d, &A_crs_d, M0_M, M0_N, m0_nnz);
std::cout << "cudaAlloc status: " << stat << std::endl;
stat = HPC_sparsemv( A_crs_d, x_d, b_d);
std::cout << "HPC_sparsemv status: " << stat << std::endl;
FP *results = new FP[M0_M];
cudaMemcpy(results, b_d, M0_M*sizeof(FP), cudaMemcpyDeviceToHost);
std::cout << "Results:" << std::endl;
for (int i = 0; i < M0_M; i++) std::cout << results[i] << std::endl;
return 0;
}
$ nvcc -o t1216 t1216.cu -lcusparse
t1216.cu(153): warning: variable "r_d" is used before its value is set
t1216.cu(153): warning: variable "p_d" is used before its value is set
t1216.cu(153): warning: variable "Ap_d" is used before its value is set
t1216.cu(153): warning: variable "r_d" is used before its value is set
t1216.cu(153): warning: variable "p_d" is used before its value is set
t1216.cu(153): warning: variable "Ap_d" is used before its value is set
$ cuda-memcheck ./t1216
========= CUDA-MEMCHECK
Beginning device allocation...
Device allocation done.
cudaAlloc status: 0
HPC_sparsemv status: 0
Results:
3
4
4
4
3
========= ERROR SUMMARY: 0 errors
$
:
-
尚不清楚您打算使用
r_d
,p_d
和Ap_d
在cudaAlloc
例程中。我将它们保持原样。但是,如果您打算将它们用于某些用途,它们可能会受到我在上面1中描述的问题的影响。
It's unclear what you intend for
r_d
,p_d
, andAp_d
in thecudaAlloc
routine. I've left them as-is. But if you intend to use them for something, they will likely be subject to the issue I describe in 1 above.
如上所述,您的代码不会在传递给 float
与 double
似乎是一致的> HPC_sparsemv 。特别是,列索引参数不匹配,并且 double
版本对我来说似乎很明智,因此我使用了它。如果使用 float
,则可能需要修改该参数。
As mentioned, your code doesn't seem to be consistent for float
vs. double
in the parameters you pass to the cusparse routines in HPC_sparsemv
. In particular, the column index parameter does not match, and the double
version seems sensible to me, so I used that. If you work with float
, you will probably need to modify that parameter.
将来,我建议您提供一个完整的代码,以证明该失败,如我所展示的。它的代码并不比您已经显示的要多,它可以使其他人更轻松地为您提供帮助。
In the future, I'd recommend that you provide a complete code, just as I have shown, to demonstrate the failure. It's not that much more code than what you have shown already, and it will make it easier for others to help you.
这篇关于CUDA:使用CUSPARSE csrmv()例程的映射错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!