cuBLAS argmin - segfault如果输出到设备内存? [英] cuBLAS argmin -- segfault if outputing to device memory?

查看:307
本文介绍了cuBLAS argmin - segfault如果输出到设备内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在cuBLAS中, cublasIsamin()给出了单精度数组的argmin。



这是完整的函数声明: cublasStatus_t cublasIsamin(cublasHandle_t handle,int n,
const float * x,int incx,int *结果)



cuBLAS程序员指南提供了有关 cublasIsamin()参数的信息:



如果我对 result 使用主机(CPU)内存,则 cublasIsamin 这里有一个例子:

  void argmin_experiment_hostOutput(){
float h_A [4] = {1,2,3, 4}; int N = 4;
float * d_A = 0;
CHECK_CUDART(cudaMalloc((void **)& d_A,N * sizeof(d_A [0])));
CHECK_CUBLAS(cublasSetVector(N,sizeof(h_A [0]),h_A,1,d_A,1));
cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(& handle));

int result; // host memory
CHECK_CUBLAS(cublasIsamin(handle,N,d_A,1,& result));
printf(argmin =%d,min =%f \\\
,result,h_A [result]);

CHECK_CUBLAS(cublasDestroy(handle));但是,如果我使用设备(GPU),则可以使用 内存结果,然后 cublasIsamin segfaults。下面是一个segfaults的例子:

  void argmin_experiment_deviceOutput(){
float h_A [4] = { 3,4}; int N = 4;
float * d_A = 0;
CHECK_CUDART(cudaMalloc((void **)& d_A,N * sizeof(d_A [0])));
CHECK_CUBLAS(cublasSetVector(N,sizeof(h_A [0]),h_A,1,d_A,1));
cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(& handle));

int * d_result = 0;
CHECK_CUDART(cudaMalloc((void **)& d_result,1 * sizeof(d_result [0]))); //刚刚足够的设备内存为1结果
CHECK_CUDART(cudaMemset(d_result,0,1 * sizeof(d_result [0])));
CHECK_CUBLAS(cublasIsamin(handle,N,d_A,1,d_result)); // SEGFAULT!

CHECK_CUBLAS(cublasDestroy(handle));
}






Nvidia指南`cublasIsamin()`可以输出到设备内存。我究竟做错了什么? >




动机:我想在多个流中同时计算几个向量的argmin 。输出到主机内存需要CPU-GPU同步,似乎杀死了多内核并发。因此,我想将argmin输出到设备内存。

解决方案

CUBLAS V2 API支持将标量结果写入设备内存。但它不支持默认情况下。根据文档的第2.4节标量参数,您需要使用 cublasSetPointerMode()使API知道标量参数指针将驻留在主机内存中。注意这也使得这些1级BLAS函数是异步的,所以你必须确保GPU在尝试访问结果指针之前已经完成了内核。



a href =http://stackoverflow.com/a/12401838/681865>此答案,以获得完整的工作示例。


In cuBLAS, cublasIsamin() gives the argmin for a single-precision array.

Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int *result)

The cuBLAS programmer guide provides this information about the cublasIsamin() parameters:

If I use host (CPU) memory for result, then cublasIsamin works properly. Here's an example:

void argmin_experiment_hostOutput(){
    float h_A[4] = {1, 2, 3, 4}; int N = 4; 
    float* d_A = 0;
    CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0])));
    CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1));
    cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(&handle));

    int result; //host memory
    CHECK_CUBLAS(cublasIsamin(handle, N, d_A, 1, &result));
    printf("argmin = %d, min = %f \n", result, h_A[result]);

    CHECK_CUBLAS(cublasDestroy(handle));
}

However, if I use device (GPU) memory for result, then cublasIsamin segfaults. Here's an example that segfaults:

void argmin_experiment_deviceOutput(){
    float h_A[4] = {1, 2, 3, 4}; int N = 4;
    float* d_A = 0;
    CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0])));
    CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1));
    cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(&handle));

    int* d_result = 0; 
    CHECK_CUDART(cudaMalloc((void**)&d_result, 1 * sizeof(d_result[0]))); //just enough device memory for 1 result
    CHECK_CUDART(cudaMemset(d_result, 0, 1 * sizeof(d_result[0])));
    CHECK_CUBLAS(cublasIsamin(handle, N, d_A, 1, d_result)); //SEGFAULT!

    CHECK_CUBLAS(cublasDestroy(handle));
}


The Nvidia guide says that `cublasIsamin()` can output to device memory. What am I doing wrong?


Motivation: I want to compute the argmin() of several vectors concurrently in multiple streams. Outputting to host memory requires CPU-GPU synchronization and seems to kill the multi-kernel concurrency. So, I want to output the argmin to device memory instead.

解决方案

The CUBLAS V2 API does support writing scalar results to device memory. But it doesn't support this by default. As per Section 2.4 "Scalar parameters" of the documentation, you need to use cublasSetPointerMode() to make the API aware that scalar argument pointers will reside in host memory. Note this also makes these level 1 BLAS functions asynchronous, so you must ensure that the GPU has completed the kernel(s) before trying to access the result pointer.

See this answer for a complete working example.

这篇关于cuBLAS argmin - segfault如果输出到设备内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆