如何编译 PTX 代码 [英] How to compile PTX code

查看:32
本文介绍了如何编译 PTX 代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要修改PTX代码,直接编译.原因是我想要一个接一个的特定指令,并且很难编写导致我的目标 PTX 代码的 cuda 代码,所以我需要直接修改 ptx 代码.问题是我可以将其编译为(fatbin 和 cubin),但我不知道如何将这些(.fatbin 和 .cubin)编译为X.o"文件.

I need to modify the PTX code and compile it directly. The reason is that I want to have some specific instructions right after each other and it is difficult to write a cuda code that results my target PTX code, So I need to modify ptx code directly. The problem is that I can compile it to (fatbin and cubin) but I dont know how to compile those (.fatbin and .cubin) to "X.o" file.

推荐答案

可能有一种方法可以通过有序的 nvcc 命令序列来做到这一点,但我不知道它并没有没发现.

There may be a way to do this with an orderly sequence of nvcc commands, but I'm not aware of it and haven't discovered it.

然而,一种可能的方法是中断并重新启动 cuda 编译序列,并在此期间(在重新启动之前)编辑 ptx 文件,尽管这很麻烦.这是基于 nvcc 手册,我不认为这是一种标准方法,因此您的里程可能会有所不同.可能有许多我没有考虑过的情况,这不起作用或不可行.

One possible approach however, albeit messy, is to interrupt and restart the cuda compilation sequence, and edit the ptx file in the interim (before the restart). This is based on information provided in the nvcc manual, and I would not consider this a standard methodology, so your mileage may vary. There may be any number of scenarios that I haven't considered where this doesn't work or isn't feasible.

为了解释这一点,我将提供一个示例代码:

In order to explain this I shall present an example code:

#include <stdio.h>

__global__ void mykernel(int *data){

  (*data)++;
}

int main(){

  int *d_data, h_data = 0;
  cudaMalloc((void **)&d_data, sizeof(int));
  cudaMemcpy(d_data, &h_data, sizeof(int), cudaMemcpyHostToDevice);
  mykernel<<<1,1>>>(d_data);
  cudaMemcpy(&h_data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
  printf("data = %d
", h_data);
  return 0;
}

为此,我放弃了 cuda 错误检查 和其他细节,有利于简洁.

For this purpose, I am dispensing with cuda error checking and other niceties, in favor of brevity.

通常我们可以将上面的代码编译如下:

Ordinarily we might compile the above code as follows:

nvcc -arch=sm_20 -o t266 t266.cu 

(假设源文件名为 t266.cu)

(assuming the source file is named t266.cu)

相反,根据参考手册,我们将编译如下:

Instead, based on the reference manual, we'll compile as follows:

nvcc -arch=sm_20 -keep -o t266 t266.cu

这将构建可执行文件,但会保留所有中间文件,包括 t266.ptx(其中包含 mykernel 的 ptx 代码)

This will build the executable, but will keep all intermediate files, including t266.ptx (which contains the ptx code for mykernel)

如果我们此时简单地运行可执行文件,我们会得到如下输出:

If we simply ran the executable at this point, we'd get output like this:

$ ./t266
data = 1
$

下一步将是编辑 ptx 文件以进行我们想要的任何更改.在这种情况下,我们将让内核将 2 添加到 data 变量而不是添加 1.相关行是:

The next step will be to edit the ptx file to make whatever changes we want. In this case, we'll have the kernel add 2 to the data variable instead of adding 1. The relevant line is:

    add.s32         %r2, %r1, 2;
                              ^
                              |
                          change the 1 to a 2 here

现在是混乱的部分.下一步是捕获所有中间编译命令,因此我们可以重新运行其中的一些:

Now comes the messy part. The next step is to capture all the intermediate compile commands, so we can rerun some of them:

nvcc -dryrun -arch=sm_20 -o t266 t266.cu --keep 2>dryrun.out

(在此处使用 stderr 的 linux 重定向).然后我们要编辑该 dryrun.out 文件,以便:

(Using linux redirection of stderr here). We then want to edit that dryrun.out file so that:

  1. 我们会保留创建 ptx 文件后的所有命令,直到文件末尾.创建 ptx 文件的那一行显然是指定 -o "t266.ptx"
  2. 的那一行
  3. 我们去掉了每行开头的前导 #$,因此实际上我们正在创建一个脚本.
  1. we retain all the commands after the creation of the ptx file, up to the end of the file. The line that creates the ptx file will be evident as the one which specifies -o "t266.ptx"
  2. we strip out the leading #$ that each line begins with, so in effect we are creating a script.

当我执行上述 2 个步骤时,我最终会得到如下脚本:

When I perform the above 2 steps, I end up with a script like this:

ptxas  -arch=sm_20 -m64  "t266.ptx"  -o "t266.sm_20.cubin"
fatbinary --create="t266.fatbin" -64 --key="xxxxxxxxxx" --ident="t266.cu" "--image=profile=sm_20,file=t266.sm_20.cubin" "--image=profile=compute_20,file=t266.ptx" --embedded-fatbin="t266.fatbin.c" --cuda
gcc -D__CUDA_ARCH__=200 -E -x c++   -DCUDA_DOUBLE_MATH_FUNCTIONS   -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/usr/local/cuda/bin/..//include"   -m64 -o "t266.cu.cpp.ii" "t266.cudafe1.cpp"
gcc -c -x c++ "-I/usr/local/cuda/bin/..//include"   -fpreprocessed -m64 -o "t266.o" "t266.cu.cpp.ii"
nvlink --arch=sm_20 --register-link-binaries="t266_dlink.reg.c" -m64   "-L/usr/local/cuda/bin/..//lib64" "t266.o"  -o "t266_dlink.sm_20.cubin"
fatbinary --create="t266_dlink.fatbin" -64 --key="t266_dlink" --ident="t266.cu " -link "--image=profile=sm_20,file=t266_dlink.sm_20.cubin" --embedded-fatbin="t266_dlink.fatbin.c"
gcc -c -x c++ -DFATBINFILE=""t266_dlink.fatbin.c"" -DREGISTERLINKBINARYFILE=""t266_dlink.reg.c"" -I. "-I/usr/local/cuda/bin/..//include"   -m64 -o "t266_dlink.o" "/usr/local/cuda/bin/crt/link.stub"
g++ -m64 -o "t266" -Wl,--start-group "t266_dlink.o" "t266.o"   "-L/usr/local/cuda/bin/..//lib64" -lcudart_static  -lrt -lpthread -ldl  -Wl,--end-group

最后,执行上面的脚本.(在 linux 中,您可以使用 chmod +x dryrun.out 或类似方法使该脚本文件可执行.)如果您在编辑 .ptx 文件时没有犯任何错误,命令应该全部成功完成,并创建一个新的 t266 可执行文件.

Finally, execute the above script. (in linux you can make this script file executable using chmod +x dryrun.out or similar.) If you haven't made any mistakes while editing the .ptx file, the commands should all complete successfully, and create a new t266 executable file.

当我们运行该文件时,我们观察到:

When we run that file, we observe:

$ ./t266
data = 2
$

表示我们的更改是成功的.

Indicating that our changes were successful.

这篇关于如何编译 PTX 代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆