合理化我的简单OpenCL内核中对全局内存的影响 [英] Rationalizing what is going on in my simple OpenCL kernel in regards to global memory

查看:291
本文介绍了合理化我的简单OpenCL内核中对全局内存的影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  const char programSource [] = 
__kernel void vecAdd(__ global int * a,__global int * b,__global int * c)
{
int gid = get_global_id(0);
for(int i = 0; i< 10; i ++){
a [gid] = b [gid] + c [gid];}
}

上述内核是每个循环执行十次向量加法。我已经使用编程指南和堆栈溢出来了解全局内存如何工作,但我仍然无法通过查看我的代码,如果我以一个好的方式访问全局内存。我以连续的方式访问它,我以一种对齐的方式猜测。卡是否为数组a,b和c加载了128kb的全局内存块?然后它为每个数组加载128kb块一次为每32个gid索引处理? (4 * 32 = 128)这似乎是我不是浪费任何全局内存带宽权利?



BTW,计算剖析器显示gld和gst效率1.00003,这似乎很奇怪,我认为只有1.0如果我的所有商店和负载合并。它是如何在1.0以上?

解决方案

是的,你的内存访问模式是非常优化的。每个halfwarp正在访问16个连续的32位字。此外,访问是64字节对齐的,因为缓冲器本身是对齐的,并且每个halfwarp的startindex是16的倍数。因此,每个halfwarp将生成一个64字节事务。所以你不应该通过非聚合访问浪费内存带宽。



由于您在最后一个问题中要求示例,因此我们修改此代码为其他(不太优化的访问模式(因为循环不会真正做任何事情,我会忽略) :

  kernel void vecAdd(global int * a,global int * b,global int * c)
{
int gid = get_global_id(0);
a [gid + 1] = b [gid * 2] + c [gid * 32];
}



首先让我们了解一下如何在计算1.3(GT200)硬件上运行



对于写入会产生一个略微不合适的模式(在它们的id范围和相应的访问模式标识的半角之后):

  gid | addr。offset |访问|推理
0- 15 | 4- 67 | 1x128B |对齐128byte块
16- 31 | 68-131 | 1x64B,1x32B |交叉128B边界,所以没有128B访问
32-47 | 132-195 | 1x128B |在对齐的128byte块
48-63 | 196-256 | 1x64B,1x32B |交叉128B边界,所以没有128B访问

所以基本上我们浪费了一半的带宽



对于read(读取),对于奇数halfwarps的访问宽度加倍没有帮助,因为它产生更多访问,这不是更快,从b线程只访问数组的偶数元素,因此对于每个halfwarp,所有访问都位于一个128byte的对齐块中(第一个元素在128B边界,因为对于该元素,gid是16的倍数=>索引是为32的倍数,对于4字节元素,这意味着地址偏移是128B的倍数)。访问模式在整个128B块上延伸,因此这将为每个半帧进行128B传输,再次占用一半带宽。



来自c的读取产生最差的情况场景,其中每个线程在其自己的128B块中索引,因此每个线程需要它自己的传输,其中一个手是一个串行化场景的位(虽然不像normaly一样糟糕,因为硬件应该能够重叠转移)。更糟糕的是,这将为每个线程传输一个32B块,浪费7/8的带宽(我们访问4B /线程,32B / 4B = 8,因此只使用1/8的带宽)。因为这是天真矩阵转换的访问模式,所以使用本地存储器(根据经验来说)非常可取。



Compute 1.0(G80) strong>



这里,创建良好访问的唯一模式是原始模式,示例中的所有模式都将创建完全非聚合访问,从而浪费7/8的带宽32B传输/线程,见上)。对于G80硬件,每个访问,其中第n个线程在halfwarp不访问第n个元素创建这样的非聚合访问



Compute 2.0(Fermi)



这里每次访问内存都会创建128B事务(收集所有数据所需的数量,因此在最坏的情况下为16x128B),但是这些事务被缓存,其中数据将被传送。现在让我们假设缓存足够大以容纳所有数据,并且没有冲突,因此每个128B缓存行最多传输一次。



访问b仍然总是传输128B块(没有其他线程索引在对应的memoryarea)。访问c将为每个线程产生128B的传输(最差的访问模式可能)。



对它的访问是以下(将它们当作读取来处理) p>

  gid |偏移|访问|推理
0- 15 | 4-67 | 1x128B |使128B块缓存
16- 31 | 68-131 | 1x128B |偏移68-127已在缓存中,为128-131带来128B以缓存
32-47 | 132-195 | - |块已经在缓存中从上一个halfwarp
48-63 | 196-259 | 1x128B |偏移196-255已在缓存中,引入256-383

将在理论上浪费几乎没有带宽。
对于这个例子,现实情况当然不是那么好,因为对c的访问会非常漂移缓存



对于profiler我假设效率超过1.0只是浮点数不准确的结果。



希望有帮助


const char programSource[] =
        "__kernel void vecAdd(__global int *a, __global int *b, __global int *c)"
        "{"
        "    int gid = get_global_id(0);"
        "for(int i=0; i<10; i++){"
        "    a[gid] = b[gid] + c[gid];}"
        "}";

The kernel above is a vector addition done ten times per loop. I have used the programming guide and stack overflow to figure out how global memory works, but I still can't figure out by looking at my code if I am accessing global memory in a good way. I am accessing it in a contiguous fashion and I am guessing in an aligned way. Does the card load 128kb chunks of global memory for arrays a, b, and c? Does it then load the 128kb chunks for each array once for every 32 gid indexes processed? (4*32=128) It seems like then I am not wasting any global memory bandwidth right?

BTW, the compute profiler shows a gld and gst efficiency of 1.00003, which seems weird, I thought it would just be 1.0 if all my stores and loads were coalesced. How is it above 1.0?

解决方案

Yes your memory access pattern is pretty much optimal. Each halfwarp is accessing 16 consecutive 32bit words. Furthermore the access is 64byte aligned, since the buffers themselves are aligned and the startindex for each halfwarp is a multiple of 16. So each halfwarp will generate one 64Byte transaction. So you shouldn't waste memory bandwidth through uncoalesced accesses.

Since you asked for examples in your last question lets modify this code for other (less optimal access pattern (since the loop doesn't really do anything I will ignore that):

kernel void vecAdd(global int* a, global int* b, global int* c)
{
   int gid = get_global_id(0);
   a[gid+1] = b[gid * 2] + c[gid * 32];
}

At first lets se how this works on compute 1.3 (GT200) hardware

For the writes to a this will generate a slightly unoptimal pattern (following the halfwarps identified by their id range and the corresponding access pattern):

   gid  | addr. offset | accesses     | reasoning
  0- 15 |     4- 67    | 1x128B       | in aligned 128byte block
 16- 31 |    68-131    | 1x64B, 1x32B | crosses 128B boundary, so no 128B access
 32- 47 |   132-195    | 1x128B       | in aligned 128byte block
 48- 63 |   196-256    | 1x64B, 1x32B | crosses 128B boundary, so no 128B access

So basically we are wasting about half our bandwidth (the less then doubled access width for the odd halfwarps doesn't help much because it generates more accesses, which isn't faster then wasting more bytes so to speak).

For the reads from b the threads access only even elements of the array, so for each halfwarp all accesses lie in a 128byte aligned block (the first element is at the 128B boundary, since for that element the gid is a multiple of 16=> the index is a multiple of 32, for 4 byte elements, that means the address offset is a multiple of 128B). The accesspattern stretches over the whole 128B block, so this will do a 128B transfer for every halfwarp, again waisting half the bandwidth.

The reads from c generate one of the worst case scenarios, where each thread indices in its own 128B block, so each thread needs its own transfer, which one one hand is a bit of a serialization scenario (although not quite as bad as normaly, since the hardware should be able to overlap the transfers). Whats worse is the fact that this will transfer a 32B block for each thread, wasting 7/8 of the bandwidth (we access 4B/thread, 32B/4B=8, so only 1/8 of the bandwidth is utilized). Since this is the accesspattern of naive matrixtransposes, it is highly advisable to do those using local memory (speaking from experience).

Compute 1.0 (G80)

Here the only pattern which will create a good access is the original, all patterns in the example will create completely uncoalesced access, wasting 7/8 of the bandwidth (32B transfer/thread, see above). For G80 hardware every access where the nth thread in a halfwarp doesn't access the nth element creates such uncoalesced accesses

Compute 2.0 (Fermi)

Here every access to memory creates 128B transactions (as many as necessary to gather all data, so 16x128B in the worst case), however those are cached, making it less obvious where data will be transfered. For the moment lets assume the cache is big enough to hold all data and there are no conflicts, so every 128B cacheline will be transferred at most once. Lets furthermoe assume a serialized execution of the halfwarps, so we have a deterministic cache occupation.

Accesses to b will still always transfer 128B Blocks (no other thread indices in the coresponding memoryarea). Access to c will generate 128B transfers per thread (worst accesspattern possible).

For accesses to a it is the following (treating them like reads for the moment):

   gid  | offset  | accesses | reasoning
  0- 15 |   4- 67 |  1x128B  | bringing 128B block to cache
 16- 31 |  68-131 |  1x128B  | offsets 68-127 already in cache, bring 128B for 128-131 to cache
 32- 47 | 132-195 |    -     | block already in cache from  last halfwarp
 48- 63 | 196-259 |  1x128B  | offsets 196-255 already in cache, bringing in 256-383

So for large arrays the accesses to a will waste almost no bandwidth theoretically. For this example the reality is of course not quite as good, since the accesses to c will trash the cache pretty nicely

For the profiler I would assume that the efficiencies over 1.0 are simply results of floating point inaccurencies.

Hope that helps

这篇关于合理化我的简单OpenCL内核中对全局内存的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆