如何在运行时生成,编译和运行CUDA内核 [英] How to generate, compile and run CUDA kernels at runtime

查看:508
本文介绍了如何在运行时生成,编译和运行CUDA内核的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,我有一个很棘手的问题:)

Well, I have quite a delicate question :)

让我们从我拥有的东西开始:

Let's start with what I have:


  1. 数据,大​​量数据,已复制到GPU

  2. 程序,由CPU(主机)生成,需要对该数组中的每个数据进行评估

  3. 程序更改非常频繁,可以生成为CUDA字符串,PTX字符串或其他形式(? ),并且每次更改后都需要重新评估

  1. Data, large array of data, copied to GPU
  2. Program, generated by CPU (host), which needs to be evaluated for every data in that array
  3. The program changes very frequently, can be generated as CUDA string, PTX string or something else (?) and needs to be re-evaluated after each change

我想要的是:基本上只是想做出这尽可能有效(快速),例如。避免将CUDA编译为PTX。解决方案甚至可以完全是特定于设备的,这里不需要大的兼容性:)

What I want: Basically just want to make this as effective (fast) as possible, eg. avoid compilation of CUDA to PTX. Solution can be even completely device-specific, no big compatibility is required here :)

我所知道的:我已经知道函数 cuLoadModule ,其中可以从文件中存储的PTX代码加载和创建内核。但是我认为,必须有其他方法可以直接创建内核,而无需先将其保存到文件中。也许有可能将其存储为字节码?

What I know: I already know function cuLoadModule, which can load and create kernel from PTX code stored in file. But I think, there must be some other way to create a kernel directly, without saving it to file first. Or perhaps it may be possible to store it as bytecode?

我的问题:您将如何做?您可以张贴一个示例或链接到具有类似主题的网站吗? TY

My question: How would you do that? Could you post an example or link to website with similar topic? TY


编辑:现在确定,PTX内核可以为

OK now, PTX kernel can be run from PTX string (char array) directly. Anyways I still wonder, is there some better / faster solution to this? There is still conversion from string to some PTX bytecode, which should be possibly avoided. I also suspect, that some clever way of creating device specific Cuda binary from PTX might exist, which would remove JIT compiler lag (is small, but it can add up if you have huge numbers of kernels to run) :)


推荐答案

在他的评论中,Roger Dahl已链接了以下帖子

In his comment, Roger Dahl has linked the following post

直接将PTX程序传递给CUDA驱动程序

其中两个功能的使用,即 cuModuleLoad cuModuleLoadDataEx ,已解决。前者用于从文件加载PTX代码,并将其传递给 nvcc 编译器驱动程序。后者避免了I / O,并允许将PTX代码作为C字符串传递给驱动程序。在这两种情况下,您都需要准备好PTX代码(作为CUDA内核编译的结果(要加载或复制并粘贴到C字符串中)或作为手写源)。

in which the use of two functions, namely cuModuleLoad and cuModuleLoadDataEx, are addressed. The former is used to load PTX code from file and passing it to the nvcc compiler driver. The latter avoids I/O and enables to pass the PTX code to the driver as a C string. In either cases, you need to have already at your disposal the PTX code, either as the result of the compilation of a CUDA kernel (to be loaded or copied and pasted in the C string) or as an hand-written source.

但是,如果您必须从CUDA内核开始即时创建PTX代码,会发生什么情况?按照 CUDA表达式模板中的方法,您可以定义一个包含CUDA内核的字符串,例如

But what happens if you have to create the PTX code on-the-fly starting from a CUDA kernel? Following the approach in CUDA Expression templates, you can define a string containing your CUDA kernel like

ss << "extern \"C\" __global__ void kernel( ";
ss << def_line.str() << ", unsigned int vector_size, unsigned int number_of_used_threads ) { \n";
ss << "\tint idx = blockDim.x * blockIdx.x + threadIdx.x; \n";
ss << "\tfor(unsigned int i = 0; i < ";
ss << "(vector_size + number_of_used_threads - 1) / number_of_used_threads; ++i) {\n";
ss << "\t\tif(idx < vector_size) { \n";
ss << "\t\t\t" << eval_line.str() << "\n";
ss << "\t\t\tidx += number_of_used_threads;\n";
ss << "\t\t}\n";
ss << "\t}\n";
ss << "}\n\n\n\n";

然后使用系统调用将其编译为

then using system calls to compile it as

int nvcc_exit_status = system(
         (std::string(NVCC) + " -ptx " + NVCC_FLAGS + " " + kernel_filename 
              + " -o " + kernel_comp_filename).c_str()
    );

    if (nvcc_exit_status) {
            std::cerr << "ERROR: nvcc exits with status code: " << nvcc_exit_status << std::endl;
            exit(1);
    }

并最终使用 cuModuleLoad cuModuleGetFunction 从文件中加载PTX代码并将其传递给编译器驱动程序,例如

and finally use cuModuleLoad and cuModuleGetFunction to load the PTX code from file and passing it to the compiler driver like

    result = cuModuleLoad(&cuModule, kernel_comp_filename.c_str());
    assert(result == CUDA_SUCCESS);
    result =  cuModuleGetFunction(&cuFunction, cuModule, "kernel");
    assert(result == CUDA_SUCCESS);

当然,表达式模板与该问题无关,我仅引用我在此答案中报告的想法。

Of course, expression templates have nothing to do with this problem and I'm only quoting the source of the ideas I'm reporting in this answer.

这篇关于如何在运行时生成,编译和运行CUDA内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆