CUDA __device__函数作为类成员:内联和性能? [英] CUDA __device__ function as class member: Inlining and performance?

查看:305
本文介绍了CUDA __device__函数作为类成员:内联和性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我计划将我的计算分成一个细粒度的函数/类的框架,它封装了某个部分。更多类和通常更长的参数列表:

  class Point {

坐标thisPoint;
value getPointValue();
Point getPoint(Offset offset);
Point getNumNeighbors();
Point getNeighbor(int i);
//更多

}

类操作符{

void doOperation(Point p){
// calls Point中的一些函数
}

}

,这将是一个很好的做法在任何面向对象的语言。但它的意图是在CUDA GPU上运行。我不知道:当我将所有这些细粒度函数限定为 __ device __ 并在内核中调用它们 - 如何实现它们?我会对成员函数的调用产生显着的开销,还是将内联或以其他方式有效地优化?

解决方案

GPU编译器会积极地 inline 功能进行性能反应。在这种情况下,对性能没有特别的影响。



如果一个函数不能被内联,那么会发生通常的性能开销,包括创建一个栈帧和对函数调用的调用,如在CPU调用非内联函数时观察到的那样。



如果您对某个特定示例有疑问,可以创建一个简短的测试代码,并使用 cuobjdump -sass myexe 查看生成的汇编语言(SASS),并确定函数是否内联。



没有一般限制对作为类成员/方法的 __ device __ 函数的内联。


I plan to partition my computation into a fine-grained framework of functions/classes which encapsulate a certain part.

Something like this, but with even more classes and typically longer parameter lists:

class Point{

  Coordinates thisPoint;
  Value getPointValue();
  Point getPoint(Offset offset); 
  Point getNumNeighbors();
  Point getNeighbor(int i);
  // many more

}

class Operator{

  void doOperation(Point p){
    // calls some of the functions in Point
  }

} 

Clearly, this would be a good practice in any object oriented language. But it's intended to run on a CUDA GPU. What I don't know: When I qualify all these fine-grained functions as __device__ and call them in a kernel - how will they be implemented? Will I have a significant overhead for the calls of the member functions or will this be inlined or otherwise efficiently optimized? Normally, these functions are extremely short but called many, many times.

解决方案

The GPU compiler will aggressively inline functions for performance reasonse. In that case, there should be no particular impact to performance.

If a function cannot be inlined, then the usual performance overhead would occur, involving the creation of a stack frame and a call to a function -just as you would observe on a CPU call to a non-inlined function.

If you have concerns about a specific example, you can create a short test code and look at the generated assembly language (SASS) by using cuobjdump -sass myexe and determine whether or not the function was inlined.

There are no general restrictions on inlining of __device__ functions that are class members/methods.

这篇关于CUDA __device__函数作为类成员:内联和性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆