如何编译PTX代码 [英] How to compile PTX code

查看:1991
本文介绍了如何编译PTX代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要修改PTX代码并直接编译。原因是我想要一些特定的指令紧接着彼此,并且很难写一个cuda代码,导致我的目标PTX代码,所以我需要直接修改ptx代码。
问题是我可以编译它(fatbin和cubin),但我不知道如何编译这些(.fatbin和.cubin)到X.o文件。



感谢您的帮助。

解决方案

一个方法来执行这个有序的 nvcc 命令,但我不知道它,并没有发现它。



然而,一个可能的方法,虽然很麻烦,是中断和重新启动cuda编译序列,并在临时(在重新启动之前)编辑ptx文件。这是基于 nvcc手册中提供的信息,我不认为这是一种标准方法,因此您的里程可能会有所不同。可能有任何数量的情况,我没有考虑这不工作或不可行。



为了解释这一点,我将提出一个例子代码:

  #include< stdio.h> 

__global__ void mykernel(int * data){

(* data)++;
}

int main(){

int * d_data,h_data = 0;
cudaMalloc((void **)& d_data,sizeof(int));
cudaMemcpy(d_data,& h_data,sizeof(int),cudaMemcpyHostToDevice);
mykernel<<< 1,1>>>(d_data);
cudaMemcpy(& h_data,d_data,sizeof(int),cudaMemcpyDeviceToHost);
printf(data =%d\\\
,h_data);
return 0;
}

为此,我取消了搜索错误检查和其他细分,



通常我们可以编译上面的代码如下:

  nvcc -arch = sm_20 -o t266 t266.cu 

(假设源文件命名而根据参考手册,我们将编译如下:

  nvcc -arch = sm_20 -keep -o t266 t266.cu 

这将构建可执行文件,但将保留所有中间文件,包括 t266.ptx (其中包含 mykernel



如果我们现在运行可执行文件,我们将得到如下输出:

  $ ./t266 
data = 1
$


$ b b

下一步将是编辑ptx文件以进行任何我们想要的更改。在这种情况下,我们将内核添加2到数据变量,而不是添加1.相关行是:

  add.s32%r2,%r1,2; 
^
|
将1更改为a 2此处

现在是乱七八糟的部分。下一步是捕获所有中间编译命令,所以我们可以重新运行其中的一些:

  nvcc -dryrun -arch = sm_20 -o t266 t266.cu --keep 2> dryrun.out 

(使用linux重定向 stderr 这里)。然后,我们要编辑 dryrun.out 文件,以便:


  1. 所有命令之后创建的ptx文件,直到文件的结尾。创建ptx文件的行将显而易见为指定 -ot266.ptx

  2. #$ 每行开头,因此实际上我们正在创建一个脚本。

当我执行上面的2个步骤,我最终得到一个脚本如下:

  ptxas -arch = sm_20 -m64t266.ptx-ot266.sm_20.cubin
fatbinary --create =t266.fatbin-64 --key =xxxxxxxxxx--ident =t266.cu - -image = profile = sm_20,file = t266.sm_20.cubin--image = profile = compute_20,file = t266.ptx--embedded-fatbin =t266.fatbin.c--cuda
gcc -D__CUDA_ARCH __ = 200 -E -x c ++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT-I / usr / local / cuda / bin /..//包括-m64 -ot266.cu.cpp.iit266。 cudafe1.cpp
gcc -c -x c ++-I / usr / local / cuda / bin /..//包括-fpreprocessed -m64 -ot266.ot266.cu.cpp。 ii
nvlink --arch = sm_20 --register-link-binaries =t266_dlink.reg.c-m64-L / usr / local / cuda / bin /..// lib64t266。 o-ot266_dlink.sm_20.cubin
fatbinary --create =t266_dlink.fatbin-64 --key =t266_dlink--ident =t266.cu-link--image = profile = sm_20,file = t266_dlink.sm_20.cubin--embedded-fatbin =t266_dlink.fatbin.c
gcc -c -x c ++ -DFATBINFILE =\t266_dlink.fatbin.c\ -DREGISTERLINKBINARYFILE =\t266_dlink.reg.c\-I。 -I / usr / local / cuda / bin /..//包括-m64 -ot266_dlink.o/usr/local/cuda/bin/crt/link.stub
g ++ -m64 -ot266-Wl, - start-groupt266_dlink.ot266.o-L / usr / local / cuda / bin /..// lib64-lcudart_static -lrt -lpthread -ldl - Wl, - end-group

最后,执行上述脚本。 (在linux中,您可以使用 chmod + x dryrun.out 或类似的命令使此脚本文件可执行)。如果在编辑 .ptx 文件,这些命令应该都成功完成,并创建一个新的 t266 可执行文件。



当我们运行该文件时,我们观察到:

  $。 / t266 
data = 2
$

表示我们的更改已成功。


I need to modify the PTX code and compile it directly. The reason is that I want to have some specific instructions right after each other and it is difficult to write a cuda code that results my target PTX code, So I need to modify ptx code directly. The problem is that I can compile it to (fatbin and cubin) but I dont know how to compile those (.fatbin and .cubin) to "X.o" file.

Thanks for your help in advance.

解决方案

There may be a way to do this with an orderly sequence of nvcc commands, but I'm not aware of it and haven't discovered it.

One possible approach however, albeit messy, is to interrupt and restart the cuda compilation sequence, and edit the ptx file in the interim (before the restart). This is based on information provided in the nvcc manual, and I would not consider this a standard methodology, so your mileage may vary. There may be any number of scenarios that I haven't considered where this doesn't work or isn't feasible.

In order to explain this I shall present an example code:

#include <stdio.h>

__global__ void mykernel(int *data){

  (*data)++;
}

int main(){

  int *d_data, h_data = 0;
  cudaMalloc((void **)&d_data, sizeof(int));
  cudaMemcpy(d_data, &h_data, sizeof(int), cudaMemcpyHostToDevice);
  mykernel<<<1,1>>>(d_data);
  cudaMemcpy(&h_data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
  printf("data = %d\n", h_data);
  return 0;
}

For this purpose, I am dispensing with cuda error checking and other niceties, in favor of brevity.

Ordinarily we might compile the above code as follows:

nvcc -arch=sm_20 -o t266 t266.cu 

(assuming the source file is named t266.cu)

Instead, based on the reference manual, we'll compile as follows:

nvcc -arch=sm_20 -keep -o t266 t266.cu

This will build the executable, but will keep all intermediate files, including t266.ptx (which contains the ptx code for mykernel)

If we simply ran the executable at this point, we'd get output like this:

$ ./t266
data = 1
$

The next step will be to edit the ptx file to make whatever changes we want. In this case, we'll have the kernel add 2 to the data variable instead of adding 1. The relevant line is:

    add.s32         %r2, %r1, 2;
                              ^
                              |
                          change the 1 to a 2 here

Now comes the messy part. The next step is to capture all the intermediate compile commands, so we can rerun some of them:

nvcc -dryrun -arch=sm_20 -o t266 t266.cu --keep 2>dryrun.out

(Using linux redirection of stderr here). We then want to edit that dryrun.out file so that:

  1. we retain all the commands after the creation of the ptx file, up to the end of the file. The line that creates the ptx file will be evident as the one which specifies -o "t266.ptx"
  2. we strip out the leading #$ that each line begins with, so in effect we are creating a script.

When I perform the above 2 steps, I end up with a script like this:

ptxas  -arch=sm_20 -m64  "t266.ptx"  -o "t266.sm_20.cubin"
fatbinary --create="t266.fatbin" -64 --key="xxxxxxxxxx" --ident="t266.cu" "--image=profile=sm_20,file=t266.sm_20.cubin" "--image=profile=compute_20,file=t266.ptx" --embedded-fatbin="t266.fatbin.c" --cuda
gcc -D__CUDA_ARCH__=200 -E -x c++   -DCUDA_DOUBLE_MATH_FUNCTIONS   -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/usr/local/cuda/bin/..//include"   -m64 -o "t266.cu.cpp.ii" "t266.cudafe1.cpp"
gcc -c -x c++ "-I/usr/local/cuda/bin/..//include"   -fpreprocessed -m64 -o "t266.o" "t266.cu.cpp.ii"
nvlink --arch=sm_20 --register-link-binaries="t266_dlink.reg.c" -m64   "-L/usr/local/cuda/bin/..//lib64" "t266.o"  -o "t266_dlink.sm_20.cubin"
fatbinary --create="t266_dlink.fatbin" -64 --key="t266_dlink" --ident="t266.cu " -link "--image=profile=sm_20,file=t266_dlink.sm_20.cubin" --embedded-fatbin="t266_dlink.fatbin.c"
gcc -c -x c++ -DFATBINFILE="\"t266_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"t266_dlink.reg.c\"" -I. "-I/usr/local/cuda/bin/..//include"   -m64 -o "t266_dlink.o" "/usr/local/cuda/bin/crt/link.stub"
g++ -m64 -o "t266" -Wl,--start-group "t266_dlink.o" "t266.o"   "-L/usr/local/cuda/bin/..//lib64" -lcudart_static  -lrt -lpthread -ldl  -Wl,--end-group

Finally, execute the above script. (in linux you can make this script file executable using chmod +x dryrun.out or similar.) If you haven't made any mistakes while editing the .ptx file, the commands should all complete successfully, and create a new t266 executable file.

When we run that file, we observe:

$ ./t266
data = 2
$

Indicating that our changes were successful.

这篇关于如何编译PTX代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆