编译CUDA与动态并行性回退 - 多架构/计算能力 [英] Compiling CUDA with dynamic parallelism fallback - multiple architectures/compute capability

查看:218
本文介绍了编译CUDA与动态并行性回退 - 多架构/计算能力的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一个应用程序中,我有一堆CUDA内核。有些使用动态并行性,有些不使用。为了在不支持此功能的情况下提供后备选项,只需允许应用程序继续,但可以使用减少/部分可用的功能,我该如何进行编译?

In one application, I've got a bunch of CUDA kernels. Some use dynamic parallelism and some don't. For the purposes of either providing a fallback option if this is not supported, or simply allowing the application to continue but with reduced/partially available features, how can I go about compiling?

此时,当使用 -arch = sm_35 无效的设备函数 c $ c>在670(最大 sm_30 ),不需要计算3.5。

At the moment I'm getting invalid device function when running kernels compiled with -arch=sm_35 on a 670 (max sm_30) that don't require compute 3.5.

AFAIK不使用多个 -arch = sm _ * 参数并使用多个 -gencode = * 对于可分离的编译我不得不创建一个额外的目标文件使用 -dlink ,但这不会在使用compute 3.0时创建( nvlink fatal:没有候选人发现在fatbinary 由于 -lcudadevrt ,我需要为3.5),我应该如何处理?

AFAIK you can't use multiple -arch=sm_* arguments and using multiple -gencode=* doesn't help. Also for separable compilation I've had to create an additional object file using -dlink, but this doesn't get created when using compute 3.0 (nvlink fatal : no candidate found in fatbinary due to -lcudadevrt, which I've needed for 3.5), how should I deal with this?

推荐答案

我相信这个问题已经在CUDA 6中解决了。

I believe this issue has been addressed now in CUDA 6.

简单测试:

$ cat t264.cu
#include <stdio.h>

__global__ void kernel1(){
  printf("Hello from DP Kernel\n");
}

__global__ void kernel2(){

#if __CUDA_ARCH__ >= 350
  kernel1<<<1,1>>>();
#else
  printf("Hello from non-DP Kernel\n");
#endif
}

int main(){

  kernel2<<<1,1>>>();
  cudaDeviceSynchronize();
  return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$



在我的例子中,设备0是Quadro5000,cc 2.0设备,设备1是一个GeForce GT 640,一个cc 3.5设备。

In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.

这篇关于编译CUDA与动态并行性回退 - 多架构/计算能力的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆