cuda:warp发散开销vs额外算术 [英] cuda: warp divergence overhead vs extra arithmetic

查看:226
本文介绍了cuda:warp发散开销vs额外算术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当然,通过 if switch 语句, 。



但是什么是经线发散的开销(仅调度一些线程执行某些行)与额外的无用算法? / p>

考虑以下虚拟示例:



verison 1:

  __ device__ int get_D(int A,int B,int C)
{
//对于每个线程,值A可能不同。

int D = 0;

if(A< 10)
D = A * 6;
else if(A< 17)
D = A * 6 + B * 2;
else if(A< 26)
D = A * 6 + B * 2 + C;
else
D = A * 6 + B * 2 + C * 3;

return D;
}

vs。



version 2:

  __ device__ int get_D(int A,int B,int C)
{
//对于每个线程,值A可能不同。

return A * 6 +(A> = 10)*(B * 2)+(A <26)* C +(A> = 26)*(C * 3)
}



我的实际情况是更复杂(更多条件),但是同样的想法。



问题:



伟大的版本1)比版本2慢?



版本2需要比版本1更多的ALU,并且大多数ALU被浪费在乘以0(只有少数条件语句评估为1, 0)。

解决方案

这些问题的具体答案通常很难解决提供。有很多因素影响两种情况之间的比较分析:




  • 你说A对每个线程都有可能不同,

  • 总的来说,无论你的代码是计算约束还是带宽约束都会影响答案。 (如果您的代码是带宽限制的,则两种情况之间可能存在 性能差异。)

  • 我知道您已经识别A,B,作为整数,但是一个看似无害的更改,如使他们 float 可能会显着影响答案。


$ b $幸运的是,有一些分析工具可以帮助提供清晰,具体的答案(或者可能表明两种情况之间没有太大的区别)。你做了一个很好的工作,确定你关心的2个具体案例。为什么不基准2?如果你想更深入,分析工具可以提供有关指令重放(由于经线差异带宽/计算绑定指标等)的统计信息。



使用以下覆盖语句来处理异常:


当然,通过if和switch语句的warp发散应该避免GPUs。


这不是真的。机器处理发散控制流的能力实际上是一个特征,它允许我们以像C / C ++这样的友好语言对其进行编程,并且实际上将其与一些其他加速技术为程序员提供了这种灵活性。



像任何其他优化工作一样,您应该将注意力集中在首先。您提供的此代码是否由您的应用程序完成的大部分工作?在大多数情况下,将这种层次的分析工作放在一些基本上是胶水代码或不是应用程序主要工作的一部分是没有意义的。



如果这是你的代码的大部分努力,那么分析工具真的是一个强大的方法来获得好的有意义的答案,这可能比尝试进行学术分析更有用。



现在为我的问题尝试:


那个版本1)比版本2慢?


这将取决于实际发生的具体分支级别。在最坏的情况下,对于32个线程,具有完全独立的路径,机器将完全串行化,并且实际上以1/32的峰值性能运行。线程的二进制决策树类型细分不能产生这种最坏的情况,但肯定可以在树的末尾接近它。可能观察到这个代码的超过50%的减速,可能80%或更高的减速,这是由于末端完全的线程发散。但它将在统计上取决于发散实际发生的频率(即它是数据相关的)。在最坏的情况下,我希望版本2更快。


版本2需要比版本1更多的ALU,被浪费在乘以0(只有一些选择的条件求值为1而不是0)。这是否在无用的操作中绑定了有价值的ALU,延迟了其他warp中的指令。


float int 可能实际上有帮助,可能是你可以考虑探索的东西。但是第二种情况(对我)出现与第一种情况相同的比较,但是一些额外的乘法。在浮动的情况下,机器可以每个时钟一个线程一个线程,所以它是相当快。在int的情况下,它会更慢,你可以看到具体的指令吞吐量取决于架构此处。我不会过于担心这个级别的算术。



另一种方式来挑逗所有这一切都将是,如果你的应用程序是内存带宽绑定。写内核比较感兴趣的代码,编译到ptx( nvcc -ptx ... )并比较ptx指令。这将更好地了解机器线程代码在每种情况下的样子,如果你只是做一个指令计数,你可能会发现这两种情况没有太大的区别(在这种情况下应该喜欢选项2) 。


Of course, warp divergence, via if and switch statements, is to be avoided at all costs on GPUs.

But what is the overhead of warp divergence (scheduling only some of the threads to execute certain lines) vs. additional useless arithmetic?

Consider the following dummy example:

verison 1:

__device__ int get_D (int A, int B, int C)
{
    //The value A is potentially different for every thread.

    int D = 0;

    if (A < 10)
        D = A*6;
    else if (A < 17)
        D = A*6 + B*2;
    else if (A < 26)
        D = A*6 + B*2 + C; 
    else 
        D = A*6 + B*2 + C*3;

    return D;
}

vs.

version 2:

__device__ int get_D (int A, int B, int C)
{
    //The value A is potentially different for every thread.

    return  A*6 + (A >= 10)*(B*2) + (A < 26)*C + (A >= 26)*(C*3);
}

My real scenario is more complicated (more conditions) but is the same idea.

Questions:

Is the overhead (in scheduling) of warp divergence so great that version 1) is slower than version 2?

Version 2 requires many more ALUs than version 1, and most of these are wasted on "multiplication by 0" (only a select few of the conditionals evaluate to 1 rather than 0). Does this tie up valuable ALUs in useless operations, delaying instructions in other warps?

解决方案

Concrete answers to questions like these are usually difficult to provide. There are many factors which influence the comparison analysis between the 2 cases:

  • You say A is potentially different for each thread, but the extent to which this is true will actually influence the comparison.
  • Overall, whether your code is compute bound or bandwidth bound certainly influences the answer. (If your code is bandwidth bound, there may be no performance difference between the two cases).
  • I know you've identified A, B, C, as integers, but a seemingly innocuous change like making them float could influence the answer significantly.

Fortunately there are profiling tools that can help give crisp, specific answers (or perhaps indicate that there isn't much difference between the two cases.) You've done a pretty good job of indentifying 2 specific cases you care about. Why not benchmark the 2? And if you want to dig deeper, the profiling tools can give statistics about instruction replay (comes about due to warp divergence) bandwidth/compute bound metrics, etc.

I have to take exception with this blanket statement:

Of course, warp divergence, via if and switch statements, is to be avoided at all costs on GPUs.

That's simply not true. The ability of the machine to handle divergent control flow is in fact a feature which allows us to program it in friendlier languages like C/C++, and in fact differentiates it from some other acceleration technologies that don't offer the programmer this flexibility.

Like any other optimization effort, you should focus your attention at the heavy lifting first. Does this code you've provided constitute the bulk of the work done by your application? It doesn't make sense, in most cases, to put this level of analytical effort into something that is basically glue code or not part of the main work of your app.

And if it is the bulk of the effort of your code, then the profiling tools are really a powerful way to get good meaningful answers that are likely to be more useful than trying to do an academic analysis.

Now for my stab at your questions:

Is the overhead (in scheduling) of warp divergence so great that version 1) is slower than version 2?

This will depend on the specific level of branching that actually occurs. In the worst case, with completely independent paths for 32 threads, the machine will completely serialize and you are in effect running at 1/32 the peak performance. A binary-decision-tree type subdivision of the threads cannot yield this worst case, but certainly can approach it by the end of the tree. It might be possible to observe more than a 50% slowdown on this code, possibly 80% or higher slowdown, due to complete thread divergence at the end. But it will depend statistically on how often the divergence actually occurs (i.e. it's data-dependent). In the worst case, I would expect version 2 to be faster.

Version 2 requires many more ALUs than version 1, and most of these are wasted on "multiplication by 0" (only a select few of the conditionals evaluate to 1 rather than 0). Does this tie up valuable ALUs in useless operations, delaying instructions in other warps?

float vs. int might actually help here, and might be something you could consider exploring perhaps. But the second case appears (to me) to have all the same comparisons as the first case, but a few extra multiplies. In the float case, the machine can do one multiply per thread per clock, so it's pretty fast. In the int case, it's slower, and you can see the specific instruction throughputs depending on architecture here. I wouldn't be overly concerned about that level of arithmetic. And again, it may make no difference at all if your app is memory bandwidth bound.

Another way to tease all this out would be to write kernels that compare the codes of interest, compile to ptx (nvcc -ptx ...) and compare the ptx instructions. This gives a much better idea of what the machine thread code will look like in each case, and if you just do something like an instruction count, you may find not much difference between the two cases (which should favor option 2 in that case).

这篇关于cuda:warp发散开销vs额外算术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆