使用多个“arch"的目的是什么?Nvidia的NVCC编译器中的标志? [英] What is the purpose of using multiple "arch" flags in Nvidia's NVCC compiler?

查看:16
本文介绍了使用多个“arch"的目的是什么?Nvidia的NVCC编译器中的标志?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始了解 NVCC 如何为不同的计算架构编译 CUDA 设备代码.

I've recently gotten my head around how NVCC compiles CUDA device code for different compute architectures.

据我了解,在使用 NVCC 的 -gencode 选项时,arch"是程序员应用程序所需的最小计算架构,也是 NVCC 的 JIT 编译器将为其编译 PTX 代码的最小设备计算架构.

From my understanding, when using NVCC's -gencode option, "arch" is the minimum compute architecture required by the programmer's application, and also the minimum device compute architecture that NVCC's JIT compiler will compile PTX code for.

我也明白-gencode的code"参数是NVCC完全编译应用程序的计算架构,因此不需要JIT编译.

I also understand that the "code" parameter of -gencode is the compute architecture which NVCC completely compiles the application for, such that no JIT compilation is necessary.

在检查了各种 CUDA 项目 Makefiles 后,我注意到以下情况经常发生:

After inspection of various CUDA project Makefiles, I've noticed the following occur regularly:

-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21

经过一番阅读,我发现可以在一个二进制文件中编译多个设备架构 - 在本例中为 sm_20、sm_21.

and after some reading, I found that multiple device architectures could be compiled for in a single binary file - in this case sm_20, sm_21.

我的问题是为什么需要这么多拱门/代码对?上面的arch"的值都是用的吗?

My questions are why are so many arch / code pairs necessary? Are all values of "arch" used in the above?

那和说有什么区别:

-arch compute_20
-code sm_20
-code sm_21

arch"字段中最早的虚拟架构是自动选择的,还是有其他一些晦涩的行为?

Is the earliest virtual architecture in the "arch" fields selected automatically, or is there some other obscure behaviour?

还有其他我应该注意的编译和运行时行为吗?

Is there any other compilation and runtime behaviour I should be aware of?

我已阅读手册,http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation 我仍然不清楚编译或运行时会发生什么.

I've read the manual, http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation and I'm still not clear regarding what happens at compilation or runtime.

推荐答案

大致来说,代码编译流程是这样的:

Roughly speaking, the code compilation flow goes like this:

CUDA C/C++ 设备代码源 -->PTX -->萨斯

CUDA C/C++ device code source --> PTX --> SASS

虚拟架构(例如 compute_20,由 -arch compute... 指定的任何内容)决定将生成哪种类型的 PTX 代码.附加开关(例如 -code sm_21)确定将生成哪种类型的 SASS 代码.SASS 实际上是 GPU(机器语言)的可执行目标代码.一个可执行文件可以包含多个版本的 SASS 和/或 PTX,并且有一个运行时加载器机制可以根据实际使用的 GPU 选择合适的版本.

The virtual architecture (e.g. compute_20, whatever is specified by -arch compute...) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is actually executable object code for a GPU (machine language). An executable can contain multiple versions of SASS and/or PTX, and there is a runtime loader mechanism that will pick appropriate versions based on the GPU actually being used.

正如您所指出的,GPU 操作的便利功能之一是 JIT 编译.只要有合适的 PTX 代码可用,但没有合适的 SASS 代码,GPU 驱动程序将完成 JIT 编译(不需要安装 CUDA 工具包).合适的 PTX"的定义代码是在数值上等于或低于用于运行代码的 GPU 架构的代码.举个例子,指定 arch=compute_30,code=compute_30 会告诉 nvcc 在可执行文件中嵌入 cc3.0 PTX 代码.此 PTX 代码可用于为 GPU 驱动程序支持的任何未来架构生成 SASS 代码.目前这将包括 Pascal、Volta、Turing 等架构,假设 GPU 驱动程序支持这些架构.

As you point out, one of the handy features of GPU operation is JIT-compile. JIT-compile will be done by the GPU driver (does not require the CUDA toolkit to be installed) anytime a suitable PTX code is available but a suitable SASS code is not. The definition of a "suitable PTX" code is one which is numerically equal to or lower than the GPU architecture being targeted for running the code. To pick an example, specifying arch=compute_30,code=compute_30 would tell nvcc to embed cc3.0 PTX code in the executable. This PTX code could be used to generate SASS code for any future architecture that the GPU driver supports. Currently this would include architectures like Pascal, Volta, Turing, etc. assuming the GPU driver supports those architectures.

包含多个虚拟架构(即 PTX 的多个版本)的一个优点是,您可以与更广泛的目标 GPU 设备兼容(尽管某些设备可能会触发 JIT 编译以创建必要的 SASS).

One advantage of including multiple virtual architectures (i.e. multiple versions of PTX), then, is that you have executable compatibility with a wider variety of target GPU devices (although some devices may trigger a JIT-compile to create the necessary SASS).

包含多个真正的 GPU 目标"的一个优点是(即多个 SASS 版本)是当其中一个目标设备存在时,您可以避免 JIT 编译步骤.

One advantage of including multiple "real GPU targets" (i.e. multiple SASS versions) is that you can avoid the JIT-compile step, when one of those target devices is present.

如果您指定了一组错误的选项,则可能会创建一个无法在特定 GPU 上(正确)运行的可执行文件.

If you specify a bad set of options, it's possible to create an executable that won't run (correctly) on a particular GPU.

指定大量这些选项的一个可能缺点是代码大小膨胀.另一个可能的缺点是编译时间,当您指定更多选项时,编译时间通常会更长.

One possible disadvantage of specifying a lot of these options is code size bloat. Another possible disadvantage is compile time, which will generally be longer as you specify more options.

还可以创建不包含 PTX 的可执行文件,这可能会引起那些试图掩盖其 IP 的人的兴趣.

It's also possible to create excutables that contain no PTX, which may be of interest to those trying to obscure their IP.

创建适合 JIT 的 PTX 应该由 code 开关指定一个虚拟架构.

Creating PTX suitable for JIT should be done by specifying a virtual architecture for the code switch.

这篇关于使用多个“arch"的目的是什么?Nvidia的NVCC编译器中的标志?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆