在OpenCL中使用本地内存的奇怪行为 [英] Strange behaviour using local memory in OpenCL

查看:93
本文介绍了在OpenCL中使用本地内存的奇怪行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究一个在NVIDIA Tesla C1060(驱动程序版本195.17)上起诉OpenCL的项目.但是,我遇到了一些我无法真正解释的奇怪行为.这是令我感到困惑的代码(为清晰起见和测试目的而减少):

I'm currently working on a project suing OpenCL on a NVIDIA Tesla C1060 (driver version 195.17). However I'm getting some strange behaviour I can't really explain. Here is the code which puzzles me (reduced for clarity and testing purpose):

kernel void TestKernel(global const int* groupOffsets, global       float* result,     
                       local        int* tmpData,             const int    itemcount)
{
   unsigned int groupid    = get_group_id(0);
   unsigned int globalsize = get_global_size(0);
   unsigned int groupcount = get_num_groups(0);

   for(unsigned int id = get_global_id(0); id < itemcount; id += globalsize, groupid += groupcount)
   {
      barrier(CLK_LOCAL_MEM_FENCE);
      if(get_local_id(0) == 0)
         tmpData[0] = groupOffsets[groupid]; 
      barrier(CLK_LOCAL_MEM_FENCE);
      int offset = tmpData[0];
      result[id]   = (float) offset;
   }
}

此代码应将每个工作组的偏移量加载到本地内存中,然后读回并写入相应的outputvector条目.对于大多数工作项来说,这是可行的,但是对于每个工作组,本地ID为1到31的工作项读取的值都不正确. 我的输出向量(对于workgroupsize = 128)如下:

This code should load the offset for each workgroup into local memory and then read it back and write it into the corresponding outputvector entry. For most workitems this is working, but for each workgroup the workitems with local ids 1 to 31 read an incorrect value. My output vector (for workgroupsize=128) is as following:

index       0: 0
index   1- 31: 470400
index  32-127: 0
index     128: 640
index 129-159: 471040
index 160-255: 640
index     256: 1280
index 257-287: 471680
index 288-511: 1280
...

我期望的输出是

index   0-127: 0
index 128-255: 640
index 256-511: 1280
...

奇怪的是:仅当我使用少于itemcount工作项时才会出现此问题(因此当globalsize> = itemcount时,它可以按预期工作,这意味着每个工作项仅处理一个条目).所以我猜想它与循环有关. 有人知道我在做什么错以及如何解决吗?

Strange thing is: the problem only occurs when I use less then itemcount workitems (so it works as expected when globalsize>=itemcount, meaning that every workitem processes only one entry). So I'm guessing it has something to do with the loop. Does anyone know what I'm doing wrong and how to fix it?

更新: 我发现如果我进行更改似乎很有效

Update: I found out that it seems to work if I change

if(get_local_id(0) == 0)
     tmpData[0] = groupOffsets[groupid]; 

if(get_local_id(0) < 32)
     tmpData[0] = groupOffsets[groupid]; 

这让我更加惊讶,因此尽管它可能会解决问题,但我对以这种方式解决它感到不满意(因为它可能会在其他时间中断). 此外,我宁愿避免由于额外的(据我所知对该硬件而言是永不停止的)内存访问而在Geforce 8xxx类硬件上运行时失去性能. 因此问题仍然存在.

Which astonishes me even more, so while it might fix the problem, I'm don't feel comfortable fixing it this way (as in it might break some other time). Besides I would rather avoid losing performance when running on Geforce 8xxx class hardware due to additional (uncoalesced for that hardware as far as I understand) memory accesses. So the question still remains.

推荐答案

首先,重要的是,您需要注意itemcount是本地工作量的倍数,以免在执行障碍时产生分歧.

Firstly, and importantly, you need to be careful that itemcount is a multiple of the local work size to avoid divergence when executing the barrier.

工作组中在处理器上执行内核的所有工作项必须先执行此功能,然后才能允许任何功能继续执行越过障碍.执行内核的工作组中的所有工作项都必须遇到此功能.

All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel.

您可以按以下方式实现此目标:

You could implement this as follows:

unsigned int itemcountrounded = get_local_size(0) * ((itemcount + get_local_size(0) - 1) / get_local_size(0));
for(unsigned int id = get_global_id(0); id < itemcountrounded; id += globalsize, groupid += groupcount)
{
    // ...
    if (id < itemcount)
        result[id]   = (float) offset;
}

您说过,为简化起见,减少了代码,如果您运行发布的内容会怎样?只是想知道是否还需要对全局内存设置障碍.

You said the code was reduced for simplicity, what happens if you run what you posted? Just wondering whether you need to put the barrier on global memory as well.

这篇关于在OpenCL中使用本地内存的奇怪行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆