CUDA 5.0：CUBIN和CUBLAS_device，计算能力3.5 [英] CUDA 5.0: CUBIN and CUBLAS_device, compute capability 3.5

查看：1256 发布时间：2017/3/4 14:53:09 cuda nvcc cublas

本文介绍了CUDA 5.0：CUBIN和CUBLAS_device，计算能力3.5的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图编译一个内核，使用动态并行运行CUBLAS到cubin文件。
当我尝试使用命令编译代码

I'm trying to compile a kernel that uses dynamic parallelism to run CUBLAS to a cubin file. When I try to compile the code using the command

nvcc -cubin -m64 -lcudadevrt -lcublas_device -gencode arch=compute_35,code=sm_35 -o test.cubin -c test.cu

$ c> ptxas fatal：unresolved extern function'cublasCreate_v2

I get ptxas fatal : Unresolved extern function 'cublasCreate_v2

如果我添加 -rdc = true 编译选项它编译正常，但是当我尝试加载模块使用cuModuleLoad我得到错误500：CUDA_ERROR_NOT_FOUND。来自cuda.h：

If I add the -rdc=true compile option it compiles fine, but when I try to load the module using cuModuleLoad I get error 500: CUDA_ERROR_NOT_FOUND. From cuda.h:

/**
 * This indicates that a named symbol was not found. Examples of symbols
 * are global/constant variable names, texture names, and surface names.
 */
CUDA_ERROR_NOT_FOUND                      = 500,

内核代码：

#include <stdio.h>
#include <cublas_v2.h>
extern "C" {
__global__ void a() {
    cublasHandle_t cb_handle = NULL;
    cudaStream_t stream;
    if( threadIdx.x == 0 ) {
        cublasStatus_t status = cublasCreate_v2(&cb_handle);
        cublasSetPointerMode_v2(cb_handle, CUBLAS_POINTER_MODE_HOST);
        if (status != CUBLAS_STATUS_SUCCESS) {
            return;
        }
        cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
        cublasSetStream_v2(cb_handle, stream);
    }
    __syncthreads();
    int jp;
    double A[3];
    A[0] = 4.0f;
    A[1] = 5.0f;
    A[2] = 6.0f;
    cublasIdamax_v2(cb_handle, 3, A, 1, &jp );
}
}

注意： A 是本地的，因此指向 cublasIdamax_v2 的指针上的数据未定义，因此 jp 在此代码中最终作为一个或多或少的随机值。正确的方法是在全局内存中使用 A 。

NOTE: The scope of A is local, so the data at the pointer given to cublasIdamax_v2 is undefined, and so jp ends up as a more or less random value in this code. The correct way to do it would be to have A in global memory.

主机代码： / strong>

Host code:

#include <stdio.h> #include <cuda.h> #include <cuda_runtime_api.h> int main() { CUresult error; CUdevice cuDevice; CUcontext cuContext; CUmodule cuModule; CUfunction testkernel; // Initialize error = cuInit(0); if (error != CUDA_SUCCESS) printf("ERROR: cuInit, %i\n", error); error = cuDeviceGet(&cuDevice, 0); if (error != CUDA_SUCCESS) printf("ERROR: cuInit, %i\n", error); error = cuCtxCreate(&cuContext, 0, cuDevice); if (error != CUDA_SUCCESS) printf("ERROR: cuCtxCreate, %i\n", error); error = cuModuleLoad(&cuModule, "test.cubin"); if (error != CUDA_SUCCESS) printf("ERROR: cuModuleLoad, %i\n", error); error = cuModuleGetFunction(&testkernel, cuModule, "a"); if (error != CUDA_SUCCESS) printf("ERROR: cuModuleGetFunction, %i\n", error); return 0; }

主机代码使用 nvcc -lcuda test .cpp 。
如果我用一个简单的内核（下面）替换内核并编译它 -rdc = true ，它工作正常。

The host code is compiled using nvcc -lcuda test.cpp. If I replace the kernel with a simple kernel (below) and compile it without -rdc=true, it works fine.

简单工作内核

#include <stdio.h> extern "C" { __global__ void a() { printf("hello\n"); } }

提前感谢

Soren

推荐答案

只是在您的第一种方法中缺少 -dlink ：

You are just missing -dlink in your first approach:

nvcc -cubin -m64 -lcudadevrt -lcublas_device -gencode arch=compute_35,code=sm_35 -o test.cubin -c test.cu -dlink

您还可以通过两个步骤：

You can also do that in two steps:

nvcc -m64 test.cu -gencode arch=compute_35,code=sm_35 -o test.o -dc nvcc -dlink test.o -arch sm_35 -lcublas_device -lcudadevrt -cubin -o test.cubin

这篇关于CUDA 5.0：CUBIN和CUBLAS_device，计算能力3.5的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CUDA 5.0：CUBIN和CUBLAS_device，计算能力3.5 [英] CUDA 5.0: CUBIN and CUBLAS_device, compute capability 3.5

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

CUDA 5.0：CUBIN和CUBLAS_device，计算能力3.5 [英] CUDA 5.0: CUBIN and CUBLAS_device, compute capability 3.5

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭