指令级并行和线程级并行如何在GPU上工作? [英] How does instruction level parallelism and thread level parallelism work on GPUs?

查看:490
本文介绍了指令级并行和线程级并行如何在GPU上工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我试图对数组大小n进行简单的简化,例如保存在一个工作单元中...说添加所有元素。一般策略似乎是在每个GPU上产生一些工作项,这减少了树中的项目。天真这似乎采取日志n步骤,但不是好像第一波的线程都做这些线程在一个镜头,是吗?

  for(int offset = get_local_size(0)/ 2; 
offset> 0 ;
offset>> = 1){
if(local_index< offset){
float other = scratch [local_index + offset];
float mine = scratch [local_index];
scratch [local_index] =(mine< other)?我的:其他
}
barrier(CLK_LOCAL_MEM_FENCE);
}

所以并行添加32个项目,然后该线程等待屏障。另外32路,我们在隔壁等候。另一个32步,我们在屏障等待,直到所有的线程已经完成了n / 2添加必要的去树的最高级别,我们绕着循环。很酷。



这似乎很好,但也许很复杂?我理解指令级并行性是一个很大的问题,所以为什么不产生一个线程并做一些像

  while ){
scratch [0] + = scratch [i + 16]
scratch [1] + = scratch [i + 17]
scratch [2]
...
i + = 16
}
...
int accum = 0;
accum + = scratch [0]
accum + = scratch [1]
accum + = scratch [2]
accum + = scratch [3]
.. 。

,以便所有添加都在warp内发生。现在你有一个线程来保持你的gpu忙。



现在假设指令级并行并不是一个真正的事情。以下内容,工作大小设置为32(经纱数量)。

  for(int i = get_local_id ); i + = 32; i ++){
scratch [get_local_id(0)] + = scratch [i + get_local_id(0)]
}
pre>

,然后将前32个项目添加在一起。我想这32个线程会一次又一次地发射。



如果你不放弃OpenCL的一般性,为什么要在树中减少

解决方案

一个线程不能保持GPU的繁忙。这大致与说一个线程可以保持8核CPU忙。



大致相同为了获得计算资源的最大利用率以及可用的内存带宽,有必要利用整个机器(即所有可用的资源,可以执行线程)。



对于大多数较新的GPU,你当然可以通过指令级并行性,让你的线程代码按顺序有多个独立的指令。



当你有两条指令时,像这样:

  scratch [0] + = scratch [i + 16] 
scratch [1] + = scratch [i + 17]

这对ILP很有用,因为这两个操作是完全独立的。但是,由于GPU发出存储器事务的方式,第一行代码将参与特定的存储器事务,并且第二行代码将必须参与不同 em>内存事务。



当我们一起使用一个warp时,一行代码如下:

  float other = scratch [local_index + offset]; 

会使warp的所有成员生成一个请求,但这些请求将被合并为一个单个或者两个内存事务。这是你如何实现全带宽利用。



虽然大多数现代GPU都有缓存,并且缓存将倾向于在这两种方法之间弥合差距,没有办法弥补在所有warp成员发出组合请求之间的交易中的巨大差异,而单个warp成员则依次发出一组请求。



你可能想读出GPU内存合并。由于您的问题似乎是以OpenCL为中心,因此您可能对本文档感兴趣


Let's say I'm trying to do a simple reduction over an array size n, say kept within one work unit... say adding all the elements. The general strategy seems to be to spawn a number of work items on each GPU, which reduce items in a tree. Naively this would seem to take log n steps, but it's not as if the first wave of threads all do these threads go in one shot, is it? They get scheduled in warps.

for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset >>= 1) {
     if (local_index < offset) {
       float other = scratch[local_index + offset];
       float mine = scratch[local_index];
       scratch[local_index] = (mine < other) ? mine : other;
     }
     barrier(CLK_LOCAL_MEM_FENCE);
   }

So 32 items get added in parallel, and then that thread waits at the barrier. Another 32 go and we wait at the barrier. Another 32 go and we wait at the barrier until all the threads have done the n/2 additions necessary to go at the topmost level of the tree, and we go around the loop. Cool.

This seems good, but perhaps complicated? I understand instruction level parallelism is a big deal, so why not spawn ONE thread and do something like

while(i<array size){
    scratch[0] += scratch[i+16]
    scratch[1] += scratch[i+17]
    scratch[2] += scratch[i+17]
    ...
    i+=16
}
...
int accum = 0;
accum += scratch[0]
accum += scratch[1]
accum += scratch[2]
accum += scratch[3]
...

such that all the adds happen within a warp. Now you have ONE thread going keeping the gpu as busy as you like.

Now assume instruction level parallelism isn't really a thing. What about the following, with the work size set to 32 (number of warps).

for(int i = get_local_id(0);i += 32;i++){
    scratch[get_local_id(0)] += scratch[i+get_local_id(0)]
}

and then add the first 32 items together. I imagine that those 32 threads would keep firing again and again.

If you're not adverse to giving up the generality of OpenCL, why bother reducing in a tree when you KNOW how many adds will fire per cycle?

解决方案

One thread cannot keep the GPU busy. That's roughly the same as saying one thread can keep an 8-core CPU busy.

In order to get maximum utilization of the compute resources as well as the available memory bandwidth, it's necessary to utilize the entire machine (i.e. all available resources that can execute threads).

With most newer GPUs, you can certainly get improved performance through instruction level parallelism, by having your thread code have multiple independent instructions in sequence. But you can't throw all that into a single thread and expect it to give good performance.

When you have 2 instructions in sequence, like this:

scratch[0] += scratch[i+16]
scratch[1] += scratch[i+17]

That is good for ILP because those two operations are completely independent of each other. But, because of the way GPUs issue memory transactions, the first line of code will take part in a particular memory transaction, and the second line of code will necessarily take part in a different memory transaction.

When we have a warp working together, a line of code like this:

float other = scratch[local_index + offset];

will cause all members of the warp to generate a request, but those requests will all be combined into a single or perhaps two memory transactions. That is how you can achieve full bandwidth utilization.

Although most modern GPUs have caches, and the caches will tend to bridge the gap somewhat between these two approaches, they will by no means make up for the large disparity in transactions between having all warp members issue a combined request, vs. a single warp member issue a set of requests in sequence.

You may want to read up on GPU memory coalescing. Since your question appears to be OpenCL-centric, you may be interested in this document.

这篇关于指令级并行和线程级并行如何在GPU上工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆