CUDA内核具有函数指针和可变参数模板 [英] CUDA kernel with function pointer and variadic templates

查看:385
本文介绍了CUDA内核具有函数指针和可变参数模板的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想设计一个cuda框架,它将接受用户函数,并通过设备函数指针将它们转发到内核。 CUDA可以使用可变参数模板(-stc = c ++ 11),到目前为止都很好。



但是,当内核调用设备函数指针。显然内核运行没有问题,但GPU使用率为0%。如果我简单地用实际的函数替换回调指针,那么GPU使用率是99%。这里的代码非常简单,大循环范围只是为了使事情可测量。我测量了gpu状态:

  nvidia-smi --query-gpu = utilization.gpu,utilization.mory,memory。 used --format = csv -lms 100 -f out.txt 

IIRC,用户函数需要在与内核相同的文件单元(#included可能)为了nvcc成功。 func_d在源代码中,它编译和运行良好,除了不使用函数指针(这是设计中的整个点)。



我的问题是:
为什么内核与回调设备函数指针不工作?



注意,当我printf noth callback和func_d地址,它们是相同的,如在这个示例输出中:

  args = 1的大小
callback ()address = 4024b0
func_d()address = 4024b0

另一个奇怪的是,if一个取消注释 callback()调用 kernel()然后GPU使用率回到0% code> func_d()调用仍然在那里... func_d版本运行大约4秒,而回调版本没有什么(好,〜0.1秒)。



系统规格和编译命令在下面代码的头部。



谢谢!

  //编译时使用:
// nvcc -g -G -O0 -std = c ++ 11 -arch = sm_20 -x cu sample。 cpp
//
// Nvidia Quadro 6000(计算能力2.0)
// CUDA 6.5(V6.5.12),
// Arch Linux,Nvidia驱动程序343.22-4, gcc 4.9.1
// 2014年11月


#include< stdio.h>

__device__
void func_d(double * vol)
{
* vol + = 5.4321f;
}


// CUDA内核函数
template< typename ... Types>
__global__ void kernel(void(* callback)(Types * ...))
{
double val0 = 1.2345f;

// //不使用gpu(0%gpu利用率)
// for(int i = 0; i <1000000; i ++){
// callback (& val0);
//}

//使用gpu(99%gpu利用率)
for(int i = 0; i <10000000; i ++){
func_d & val0);
}
}


//主函数
模板< typename ... Types>
void host_func(void(* callback)(Types * ...))
{
//获取用户内核的参数个数。
constexpr int I = sizeof ...(Types);
printf(size of Args =%d \\\
,I);

printf(callback()address =%x\\\
,callback);
printf(func_d()address =%x\\\
,func_d);

dim3 nblocks = 100;
int nthread = 100;
kernel< Types ...><<< nblocks,nthread>>>(callback);
}


__host__
int main(int argc,char ** argv)
{
host_func(func_d);
}


解决方案


< >

我的问题是:为什么内核与回调设备函数指针不工作?




b $ b

可能有几个问题需要解决。但是最简单的答案是,在主机代码中获取设备实体的地址是非法的。这对于设备变量以及设备功能也是如此。现在,您可以 获取这些实体的地址。但是地址是垃圾。它不能在主机或设备上使用。如果您尝试使用它们,您将在设备上获得未定义的行为,这通常会使您的内核停止。



主机地址可能在主机码。可以在设备代码中观察设备地址。任何其他行为需要API干预。


  1. 您似乎使用 nvidia-smi 利用率查询作为一种衡量事物是否正确运行的度量。我建议您正确的cuda错误检查,而且您可能希望使用 cuda-memcheck 运行您的代码。


  2. 为什么 func_d 的地址与 callback ?的地址匹配?因为您在主机代码中使用两个地址,而且这两个地址都是垃圾。为了说服你自己,在你的内核的最后添加一行像这样:

      if((!threadIdx。 x)&&(!blockIdx.x))printf(in-kernel func_d()address =%x \\\
    ,func_d);

    ,您将看到它打印出的东西与正在打印的东西不同。 p>


  3. 「装置使用率如何?」一旦设备遇到错误,内核就会终止,并且利用率将变为零。希望这将解释这个声明为你:另一个奇怪的是,如果一个注释在kernel()callback()调用,然后GPU使用回到0%,即使func_d()调用仍然在那里...


  4. 如何解决此问题?我不知道一个伟大的方法来解决这个问题。如果您在编译时知道有限数量的CUDA函数,您希望用户能够从中进行选择,那么相应的事情可能是创建一个适当的索引,并使用它来选择函数。如果你真的想,你可以运行一个初始/设置内核,它将获取你所关心的函数的地址,然后你可以将这些地址传回主机代码,并在后续的内核调用中作为参数使用,应该允许你的机制工作。但我不知道它如何防止需要通过编译时已知的一组预定义函数进行索引。如果你向前的方向是你希望用户能够在运行时提供用户定义的函数,我认为你会发现这很难做 >与CUDA运行时API(我怀疑这可能会在将来更改。)我提供了一个相当扭曲的机制尝试这样做


  5. 以后,请缩进您的代码。


一个充分运作的例子,展示了上面的几个想法。特别是,我以相当粗糙的方式显示, func_d 地址可以在设备代码中获取,然后传递回主机,然后用作未来的内核参数成功选择/调用该设备功能。

  $ cat t595.cu 
//编译时使用:
// nvcc -g -G -O0 -std = c ++ 11 -arch = sm_20 -x cu sample.cpp
//
// Nvidia Quadro 6000(计算能力2.0)
// CUDA 6.5(V6.5.12),
// Arch Linux,Nvidia驱动程序343.22-4,gcc 4.9.1
// 2014年11月

$ b b #include< stdio.h>

__device__
void func_d(double * vol)
{
if((!threadIdx.x)&&(!blockIdx.x))printf value =%f \\\
,* vol);
* vol + = 5.4321f;
}

模板< typename ... Types>
__global__ void setup_kernel(void(** my_callback)(Types * ...)){
* my_callback = func_d;
}

// CUDA内核函数
template< typename ... Types>
__global__ void kernel(void(* callback)(Types * ...))
{
double val0 = 1.2345f;

// //不使用gpu(0%gpu利用率)
// for(int i = 0; i <1000000; i ++){
callback ; val0);
//}

val0 = 0.0f;
//使用gpu(99%gpu利用率)
// for(int i = 0; i <10000000; i ++){
func_d(& val0);
//}
if((!threadIdx.x)&&(!blockIdx.x))printf(in-kernel func_d()address =%x \\\
,func_d) ;
}


//主函数
模板< typename ... Types>
void host_func(void(* callback)(Types * ...))
{
//获取用户内核的参数个数。
constexpr int I = sizeof ...(Types);
printf(size of Args =%d \\\
,I);

printf(callback()address =%x\\\
,callback);
printf(func_d()address =%x\\\
,func_d);

dim3 nblocks = 100;
int nthread = 100;
unsigned long long * d_callback,h_callback;
cudaMalloc(& d_callback,sizeof(unsigned long long));
setup_kernel<<< 1,1>>>((void(**)(Types * ...))d_callback);
cudaMemcpy(& h_callback,d_callback,sizeof(unsigned long long),cudaMemcpyDeviceToHost);
kernel< Types ...><<< nblocks,nthread>>>((void(*)(Types * ...))h_callback);
cudaDeviceSynchronize();
}


__host__
int main(int argc,char ** argv)
{
host_func(func_d);
}
$ nvcc -std = c ++ 11 -arch = sm_20 -o t595 t595.cu
$ cuda-memcheck ./t595
======= == CUDA-MEMCHECK
Args = 1的大小
callback()address = 4025dd
func_d()address = 4025dd
value = 1.234500
value = 0.000000
in-kernel func_d()address = 4
=========错误摘要:0错误
$


I am trying to design a cuda framework which would accept user functions and forward them to the kernel, through device function pointers. CUDA can work with variadic templates (-stc=c++11) and so far so good.

However, I hit a problem when the kernel calls the device function pointer. Apparently the kernel runs with no problem, but the GPU usage is 0%. If I simply replace the callback pointer with the actual function then GPU usage is 99%. The code here is very simple and the large loop range is simply to make things measurable. I measured the gpu status with:

nvidia-smi --query-gpu=utilization.gpu,utilization.mory,memory.used --format=csv -lms 100 -f out.txt

IIRC, the user function needs to be in the same file unit as the kernel (#included perhaps) in order to nvcc succeed. The func_d is right there in the source and it compiles and runs fine, well besides not working with the function pointer (which is the whole point in this design).

My question is: Why the kernel with the callback device function pointer is not working?

Note that, when I printf noth the callback and func_d addresses, they are the same, as in this sample output:

size of Args = 1
callback() address = 4024b0
func_d()   address = 4024b0

Another weird thing is, if one uncomments the callback() call in kernel() then GPU usage is back to 0%, even with the func_d() call still in there... The func_d version takes about 4 seconds to run, whereas the callback version takes nothing (well, ~0.1sec).

System specs and compilation command are in the head of the code below.

Thanks!

// compiled with:
// nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp
//
// Nvidia Quadro 6000 (compute capability 2.0)
// CUDA 6.5 (V6.5.12),
// Arch Linux, Nvidia driver 343.22-4, gcc 4.9.1
// Nov, 2014


#include <stdio.h>

__device__
void func_d(double* vol)
{
*vol += 5.4321f;
}


// CUDA kernel function
template <typename... Types>
__global__ void kernel( void (*callback)(Types*...) )
{
double val0 = 1.2345f;

//  // does not use gpu (0% gpu utilization)
//  for ( int i = 0; i < 1000000; i++ ) {
//  callback( &val0 );
//  }

// uses gpu (99% gpu utilization)
for ( int i = 0; i < 10000000; i++ ) {
func_d( &val0 );
}
}


// host function
template <typename... Types>
void host_func( void (*callback)(Types*...) )
{
// get user kernel number of arguments.
constexpr int I = sizeof...(Types);
printf("size of Args = %d\n",I);

printf("callback() address = %x\n",callback);
printf("func_d()   address = %x\n",func_d);

dim3 nblocks = 100;
int nthread = 100;
kernel<Types...><<<nblocks,nthread>>>( callback );
}


__host__
int main(int argc, char** argv)
{
host_func(func_d);
}

解决方案

My question is: Why the kernel with the callback device function pointer is not working?

There are probably several issues to address. But the simplest answer is because it is illegal to take the address of device entities in host code. This is true for device variables as well as device functions. Now, you can take the address of those entities. But the address is garbage. It is not usable either on the host or on the device. If you attempt to use them anyway, you'll get undefined behavior on the device, which will usually bring your kernel to a halt.

Host addresses may be observed in host code. Device addresses may be observed in device code. Any other behavior requires API intervention.

  1. You appear to be using the nvidia-smi utilization query as a measure of whether or not things are running correctly. I would suggest doing proper cuda error checking instead, and also you may wish to run your code with cuda-memcheck.

  2. "Why then does the address of func_d match the address of callback?" Because you are taking both addresses in host code, and both addresses are garbage. To convince yourself of this, add a line something like this at the very end of your kernel:

    if ((!threadIdx.x)&&(!blockIdx.x)) printf("in-kernel func_d()   address = %x\n",func_d);
    

    and you will see that it prints out something different from what is being printed on the host.

  3. "What about the device utilization?" As soon as the device encounters an error, the kernel terminates, and utilization goes to zero. Hopefully this will explain this statement for you: "Another weird thing is, if one uncomments the callback() call in kernel() then GPU usage is back to 0%, even with the func_d() call still in there... "

  4. "How can I fix this?" I don't know of a great way to fix this. If you have a limited number of CUDA functions known at compile-time, that you want the user to be able to select from, then the appropriate thing is probably to just create an appropriate index, and use that to select the function. If you really want to, you can run a preliminary/setup kernel, which will take the address of functions you care about, and then you can pass these addresses back to host code, and use them in subsequent kernel calls as parameters, and this should allow your mechanism to work. But I don't see how it prevents the need to index through a set of pre-defined functions known at compile-time. If the direction you are headed in is that you want the user to be able to provide user-defined functions at runtime I think you will find this quite difficult to do at the moment with the CUDA runtime API (I suspect this is likely to change in the future.) I provided a rather contorted mechanism to try to do this here (read the whole question and answer; talonmies answer there is informative as well). If, on the other hand, you are willing to use the CUDA driver API, then it should be possible, although somewhat involved, since this is exactly what is done in a very elegant fashion in PyCUDA, for example.

  5. In the future, please indent your code.

Here's a fully worked example, demonstrating a few of the ideas above. In particular, I am showing in a rather crude fashion, that the func_d address can be taken in device code, then passed back to the host, then used as a future kernel parameter to successfully select/call that device function.

$ cat t595.cu
// compiled with:
// nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp
//
// Nvidia Quadro 6000 (compute capability 2.0)
// CUDA 6.5 (V6.5.12),
// Arch Linux, Nvidia driver 343.22-4, gcc 4.9.1
// Nov, 2014


#include <stdio.h>

__device__
void func_d(double* vol)
{
  if ((!threadIdx.x) && (!blockIdx.x)) printf("value = %f\n", *vol);
  *vol += 5.4321f;
}

template <typename... Types>
__global__ void setup_kernel(void (**my_callback)(Types*...)){
  *my_callback = func_d;
}

// CUDA kernel function
template <typename... Types>
__global__ void kernel( void (*callback)(Types*...) )
{
  double val0 = 1.2345f;

//  // does not use gpu (0% gpu utilization)
//  for ( int i = 0; i < 1000000; i++ ) {
  callback( &val0 );
//  }

  val0 = 0.0f;
// uses gpu (99% gpu utilization)
//  for ( int i = 0; i < 10000000; i++ ) {
    func_d( &val0 );
//  }
  if ((!threadIdx.x)&&(!blockIdx.x)) printf("in-kernel func_d()   address = %x\n",func_d);
}


// host function
template <typename... Types>
void host_func( void (*callback)(Types*...) )
{
// get user kernel number of arguments.
  constexpr int I = sizeof...(Types);
  printf("size of Args = %d\n",I);

  printf("callback() address = %x\n",callback);
  printf("func_d()   address = %x\n",func_d);

  dim3 nblocks = 100;
  int nthread = 100;
  unsigned long long *d_callback, h_callback;
  cudaMalloc(&d_callback, sizeof(unsigned long long));
  setup_kernel<<<1,1>>>((void (**)(Types*...))d_callback);
  cudaMemcpy(&h_callback, d_callback, sizeof(unsigned long long), cudaMemcpyDeviceToHost);
  kernel<Types...><<<nblocks,nthread>>>( (void (*)(Types*...))h_callback );
  cudaDeviceSynchronize();
}


__host__
int main(int argc, char** argv)
{
  host_func(func_d);
}
$ nvcc -std=c++11 -arch=sm_20 -o t595 t595.cu
$ cuda-memcheck ./t595
========= CUDA-MEMCHECK
size of Args = 1
callback() address = 4025dd
func_d()   address = 4025dd
value = 1.234500
value = 0.000000
in-kernel func_d()   address = 4
========= ERROR SUMMARY: 0 errors
$

这篇关于CUDA内核具有函数指针和可变参数模板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆