cuBLAS argmin -- 如果输出到设备内存,会出现段错误? [英] cuBLAS argmin -- segfault if outputing to device memory?
问题描述
在 cuBLAS 中,cublasIsamin()
给出单精度数组的 argmin.
In cuBLAS, cublasIsamin()
gives the argmin for a single-precision array.
这是完整的函数声明: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n,const float *x, int incx, int *result)
Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
cuBLAS 程序员指南提供了有关 cublasIsamin()
参数的信息:
The cuBLAS programmer guide provides this information about the cublasIsamin()
parameters:
如果我为 result
使用 host (CPU) 内存,则 cublasIsamin
可以正常工作.这是一个例子:
If I use host (CPU) memory for result
, then cublasIsamin
works properly. Here's an example:
void argmin_experiment_hostOutput(){
float h_A[4] = {1, 2, 3, 4}; int N = 4;
float* d_A = 0;
CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0])));
CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1));
cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(&handle));
int result; //host memory
CHECK_CUBLAS(cublasIsamin(handle, N, d_A, 1, &result));
printf("argmin = %d, min = %f
", result, h_A[result]);
CHECK_CUBLAS(cublasDestroy(handle));
}
但是,如果我为 result
使用 device (GPU) 内存,则 cublasIsamin
段错误.这是一个段错误的例子:
However, if I use device (GPU) memory for result
, then cublasIsamin
segfaults. Here's an example that segfaults:
void argmin_experiment_deviceOutput(){
float h_A[4] = {1, 2, 3, 4}; int N = 4;
float* d_A = 0;
CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0])));
CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1));
cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(&handle));
int* d_result = 0;
CHECK_CUDART(cudaMalloc((void**)&d_result, 1 * sizeof(d_result[0]))); //just enough device memory for 1 result
CHECK_CUDART(cudaMemset(d_result, 0, 1 * sizeof(d_result[0])));
CHECK_CUBLAS(cublasIsamin(handle, N, d_A, 1, d_result)); //SEGFAULT!
CHECK_CUBLAS(cublasDestroy(handle));
}
<小时>
Nvidia 指南说 `cublasIsamin()` 可以输出到设备内存.我究竟做错了什么?
<小时>动机:我想在多个流中同时计算多个向量的 argmin().输出到主机内存需要 CPU-GPU 同步,并且似乎会杀死多内核并发.所以,我想将 argmin 输出到设备内存.
The Nvidia guide says that `cublasIsamin()` can output to device memory. What am I doing wrong?
Motivation: I want to compute the argmin() of several vectors concurrently in multiple streams. Outputting to host memory requires CPU-GPU synchronization and seems to kill the multi-kernel concurrency. So, I want to output the argmin to device memory instead.
推荐答案
CUBLAS V2 API 确实支持将标量结果写入设备内存.但它默认不支持这个.根据文档的第 2.4 节标量参数",您需要使用 cublasSetPointerMode()
让 API 知道标量参数指针将驻留在设备内存中.请注意,这也会使这些 1 级 BLAS 函数异步,因此您必须确保 GPU 在尝试访问结果指针之前已完成内核.
The CUBLAS V2 API does support writing scalar results to device memory. But it doesn't support this by default. As per Section 2.4 "Scalar parameters" of the documentation, you need to use cublasSetPointerMode()
to make the API aware that scalar argument pointers will reside in device memory. Note this also makes these level 1 BLAS functions asynchronous, so you must ensure that the GPU has completed the kernel(s) before trying to access the result pointer.
有关完整的工作示例,请参阅此答案.
See this answer for a complete working example.
这篇关于cuBLAS argmin -- 如果输出到设备内存,会出现段错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!