编译CUDA与动态并行性回退 - 多架构/计算能力 [英] Compiling CUDA with dynamic parallelism fallback - multiple architectures/compute capability
问题描述
在一个应用程序中,我有一堆CUDA内核。有些使用动态并行性,有些不使用。为了在不支持此功能的情况下提供后备选项,或只需允许应用程序继续,但可以使用减少/部分可用的功能,我该如何进行编译?
In one application, I've got a bunch of CUDA kernels. Some use dynamic parallelism and some don't. For the purposes of either providing a fallback option if this is not supported, or simply allowing the application to continue but with reduced/partially available features, how can I go about compiling?
此时,当使用 -arch = sm_35 $编译的内核运行时,我得到
无效的设备函数
c $ c>在670(最大 sm_30
),不需要计算3.5。
At the moment I'm getting invalid device function
when running kernels compiled with -arch=sm_35
on a 670 (max sm_30
) that don't require compute 3.5.
AFAIK不使用多个 -arch = sm _ *
参数并使用多个 -gencode = *
对于可分离的编译我不得不创建一个额外的目标文件使用 -dlink
,但这不会在使用compute 3.0时创建( nvlink fatal:没有候选人发现在fatbinary
由于 -lcudadevrt
,我需要为3.5),我应该如何处理?
AFAIK you can't use multiple -arch=sm_*
arguments and using multiple -gencode=*
doesn't help. Also for separable compilation I've had to create an additional object file using -dlink
, but this doesn't get created when using compute 3.0 (nvlink fatal : no candidate found in fatbinary
due to -lcudadevrt
, which I've needed for 3.5), how should I deal with this?
推荐答案
我相信这个问题已经在CUDA 6中解决了。
I believe this issue has been addressed now in CUDA 6.
简单测试:
$ cat t264.cu
#include <stdio.h>
__global__ void kernel1(){
printf("Hello from DP Kernel\n");
}
__global__ void kernel2(){
#if __CUDA_ARCH__ >= 350
kernel1<<<1,1>>>();
#else
printf("Hello from non-DP Kernel\n");
#endif
}
int main(){
kernel2<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$
在我的例子中,设备0是Quadro5000,cc 2.0设备,设备1是一个GeForce GT 640,一个cc 3.5设备。
In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.
这篇关于编译CUDA与动态并行性回退 - 多架构/计算能力的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!