分支发散,CUDA和动力蒙特卡罗 [英] Branch divergence, CUDA and Kinetic Monte Carlo

查看:217
本文介绍了分支发散,CUDA和动力蒙特卡罗的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我有一个代码,使用动力蒙特卡罗在一个格子,以模拟的东西。我使用CUDA在我的GPU上运行这个代码(虽然我相信同样的问题也适用于OpenC1)。

So, I have a code that uses Kinetic Monte Carlo on a lattice in order to simulate something. I am using CUDA to run this code on my GPU (although I believe the same question applies to OpenCl as well).

这意味着我把我的格子分成小子-lattices和每个线程操作其中之一。因为我在做KMC,每个线程都有这样的代码:

This means that I divide my lattice into little sub-lattices and each thread operates on one of them. Since I am doing KMC, each thread has this code :

   While(condition == true){
     *Grab a sample u from U[0,1]*
      for(i = 0; i < 100;i++){
         *Do some stuff here to generate A*
          if(A > u){
              *Do more stuff here, which could include updates to global memory*
               break();
           }
      }
   }

所以u和100只是一个随机数。在代码中,这可能是1000甚至10000.

A is different for different threads and so is u and 100 is just a random number. In code, this could be 1000 or even 10000.

因此,当线程通过时,我们不会有分支分歧,如果?这对性能有何影响?我知道答案取决于if子句中的代码,但是当我添加越来越多的线程时,这将如何扩展?

So, won't we have branch divergence when the time comes for a thread to pass through that if? How badly can this affect performance? I know that the answer depends on the code inside the if-clause but how will this scale as I add more and more threads?

任何关于如何估计损失的参考/表现的收益也将受到欢迎。

Any reference on how I can estimate losses/gains in performance would also be welcome.

谢谢!

推荐答案

GPU以32线程,称为warp。发散只能在一个经线内发生。所以,如果你能够以这样一种方式安排你的线程: if 条件在整个warp中以相同的方式评估,则没有发散。

The GPU runs threads in groups of 32 threads, called warps. Divergence can only happen within a warp. So, if you are able to arrange your threads in such a way that the if condition evaluates the same way in the entire warp, there is no divergence.

如果中有分歧时,从概念上讲,GPU简单地忽略了线程的结果和内存请求,其中 if condition is false。

When there is divergence in an if, conceptually, the GPU simply ignores the results and memory requests from threads in which the if condition was false.

所以,说 if true 为特定warp中的10个线程。在中,如果 c>,经度的潜在计算性能从100%降低到10/32 * 100 = 31%,因为22个线程被禁用如果可能已经在工作,但现在只是占据了翘曲的空间。

So, say that the if evaluates to true for 10 of the threads in a particular warp. While inside that if, the potential compute performance of the warp is reduced from 100% to 10 / 32 * 100 = 31%, as the 22 threads that got disabled by the if could have been doing work but are now just taking up room in the warp.

一旦退出 if ,禁用的线程再次启用,并且warp以100%的潜在计算性能继续。

Once exiting the if, the disabled threads are enabled again, and the warp proceeds with a 100% potential compute performance.

An if-else 的行为方式大致相同。当warp到达 else 时,在 if 中启用的线程被禁用,

An if-else behaves in much the same way. When the warp gets to the else, the threads that were enabled in the if become disabled, and the ones that were disabled become enabled.

中,对于每个线程循环不同次数的循环,线程被禁用,因为它们的迭代计数达到其设置的数字,但是整个warp必须保持循环,直到具有最高迭代计数的线程完成。

In a for loop that loops a different number of times for each thread in the warp, threads are disabled as their iteration counts reach their set numbers, but the warp as a whole must keep looping until the thread with the highest iteration count is done.

潜在的内存吞吐量,事情有点复杂。如果算法是存储器限制的,则可能没有由于翘曲分散而导致的很多或任何性能损失,因为可以减少存储器事务的数量。如果warp中的每个线程正在从全局存储器中的完全不同的位置读取(GPU的坏情况),则将为每个禁用的线程节省时间,因为不必执行它们的存储器事务。另一方面,如果线程正在从被优化以供GPU访问的数组读取,则多个线程将共享来自单个事务的结果。在这种情况下,用于禁用线程的值是从内存中读取的,然后与禁用线程可以完成的计算一起丢弃。

When looking at potential memory throughput, things are a little bit more complicated. If an algorithm is memory bound, there might not be much or any performance lost due to warp divergence, because the number of memory transactions may be reduced. If each thread in the warp was reading from an entirely different location in global memory (a bad situation for a GPU), time would be saved for each of the disabled threads as their memory transactions would not have to be performed. On the other hand, if the threads were reading from an an array that had been optimized for access by the GPU, multiple threads would be sharing the results from a single transaction. In that case, the values that were meant for the disabled threads were read from memory and then discarded together with the computations the disabled thread could have done.

现在你可能有足够的概述,以能够做出很好的判断调用,多少经线差异会影响你的表现。最糟糕的情况是当一个线程中只有一个线程处于活动状态。然后,您将获得计算绑定性能的1/32 = 3.125%的潜力。最佳情况是31/32 = 96.875%。对于如果是完全随机的,你会得到50%。如上所述,内存绑定的性能取决于所需的内存事务数量的变化。

So, now you probably have enough of an overview to be able to do pretty good judgement calls as to how much warp divergence is going to affect your performance. The worst case is when only a single thread in a warp is active. Then you get 1/32 = 3.125% of the potential for compute bound performance. Best case is 31/32 = 96.875%. For an if that is fully random, you get 50%. And as mentioned, memory bound performance depends on the change in the number of required memory transactions.

这篇关于分支发散,CUDA和动力蒙特卡罗的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆