将程序移植到CUDA - 内核在另一个内核? [英] Porting a program to CUDA - kernel inside another kernel?

查看:497
本文介绍了将程序移植到CUDA - 内核在另一个内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图并行化包含几个过程的函数。函数为:

I am trying to parallelize a function that contains several procedures. The function goes:

void _myfunction(M1,M2){
    for (a = 0; a < A; a++) {
       Amatrix = procedure1(M1) /*contains for loops*/;
       Bmatrix = procedure2(M1) /*contains for loops*/;

       ...
       for ( z = 1 ; z < Z ; z++ ){
                 calculations with Amatrix(z) and obtain AAmatrix 
                 calculations with Bmatrix(z) and obtain BBmatrix    
          for ( e = 1; e < E; e++) { 
                 calculations with AAmatrix(e) and obtain CCmatrix 
                 calculations with BBmatrix(e) and obtain DDmatrix
          }
       }
       for (q = 0; q < Q; q++){ calculations with CCMatrix(q) }
       for (m = 0; m < M; m++){ calculations with DDMatrix(q) }
    }
}

函数 procedure1() procedure2(),我已经将它们移植到CUDA,一切都很好程序有自己的for循环)。
这些过程分离的原因是因为它们是概念上独立的算法,与具有更一般概念的其余代码相反。

Concerning the functions procedure1() and procedure2(), I have ported them to CUDA and everything is going fine (each of these procedures have their own for loops). The reason that these procedures are separated is because they are conceptually independent algorithms, opposite to the rest of the code that has a more general concept.

现在我试图将其余代码移植到CUDA,但我不知道该怎么做。当然,我想保持整个函数的同样的结构,如果可能的话。我的第一个想法是将函数 _myfunction(arg1,arg2,..)转换为内核,但我的问题是已经有两个内核函数。在某个地方我已经阅读,我们可以使用流,但我再也不知道如何做,如果它是正确的。

Now I am trying to port the rest of the code to CUDA, but I am not sure about what to do. Of course, I want to keep the same structure of the entire function, if it is possible. My first thought was to transform the function _myfunction(arg1,arg2,..) into a kernel but my problem is that there are already two kernel function that are executed in order inside. Somewhere I have read that we can use streams, but again I am not sure how to do it and if it is correct.

问题:有人可以提供如何将程序移植到CUDA的提示吗?

Question: Can somebody give a hint on how to port a program to CUDA?

PS:我使用的是GeForce 9600GT(Compute Capability 1.1)和CUDA Toolkit 5.0。

P.S: I am using GeForce 9600GT (Compute Capability 1.1) and the CUDA Toolkit 5.0.

推荐答案

相同结构 理论可能无法在CUDA中实现,因为问题可能不可并行化。这基本上是由于问题的性质。在您的设备中,您无法从另一个内核启动内核。此机制称为 动态并行性 是最近的。计算能力 1.1 不支持此操作。据我所知,动态并行性是从CUDA开普勒架构引入的。你必须做一些研究,以检查哪些设备支持这(当然如果你有兴趣)。总而言之,您不会能够通过相同结构理论实现这一点。但意味着根本不能实现
这是我的建议,以便移植您的和任何其他程序:

The same structure theory might not be achievable in CUDA because the problem might not be parallelizable. That's basically due to the nature of the problem. In your device you cannot launch a kernel from within another kernel. This mechanism is called Dynamic Parallelism and is very recent. Compute Capability 1.1 doesn't support this. To my knowledge the Dynamic Parallelism is introduced since CUDA Kepler architecture. You'd have to make a bit of research to check out which devices support this (of course if you are interested). Summing up, you won't be able to achieve this with the same structure theory. But that doesn't mean you cannot achieve it at all. Here are my recommendations in order to port your, and any other, program:


  1. 阅读 CUDA C编程指南 CUDA C最佳做法指南(假设您使用CUDA C)

  2. 重新构建/重新思考原始问题,并查看是否可以并行化。

  3. 对您的代码执行静态分析。 (基本上阅读代码,根据你的编程知识使事情更快)

  4. 对你的代码进行动态分析。你可以通过工具实现。我会推荐 Valgrind 。它具有广泛的用途,它是免费的,它有很多不同的模块,它们帮助您检查程序的不同方面,它在许多平台上得到支持。我使用它,我认为很好

  5. 在这两个分析后,你在你的程序中寻找有问题的点,例如。这需要程序的大部分执行时间。

  6. 尝试并行化这些点。正如我说的结构必须是相同的。

  1. Read CUDA C Programming Guide and CUDA C Best Practices Guide (assuming you use CUDA C)
  2. Restructure/rethink the original problem and see if it can be parallelized.
  3. Perform a static analysis of your code. (basically reading the code and according you programming knowledge make things faster)
  4. Perform a dynamic analysis of your code. You can achieve this through tools. I would recommend Valgrind. It has wide usage, it's free, it has a lot of different modules which help you inspect different aspects of your program, and it's supported in a lot of platforms. I used it and I think is good
  5. After this two analysis you look for problematic points in your program, e.g. that take most of the execution time of the program.
  6. Try to parallelize those point. As I said the structure doesn't have to be the same.

注意#1:一个新手的前两个阅读是强制性的,否则你会花很多调试。
注意#2:如果你在程序中没有发现问题,我很怀疑你可以加快你的代码与CUDA。但这是一个极端的情况,我会说。

Note#1: As your a newbie the first two reading are mandatory otherwise you'd spend a lot in debugging. Note#2: If you don't find problematic points in your program I would highly doubt you could speed up your code with CUDA. But this is an extreme case, I would say.

这篇关于将程序移植到CUDA - 内核在另一个内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆