尽管存在障碍,开放式CL也无法同步 [英] Open CL no synchronization despite barrier

查看:116
本文介绍了尽管存在障碍,开放式CL也无法同步的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始通过Python的PyOpenCL接口使用OpenCL.我试图创建一个非常简单的循环"程序,其中每个内核中每个循环的结果都取决于上一个循环中另一个内核的输出,但是我遇到了同步问题:

I just started to use OpenCL via the PyOpenCL interface from Python. I tried to create a very simple "recurrent" program where the outcome of each loop in every kernel depends on the output of another kernel from the last loop-cycle, but I am running into synchronization problems:

__kernel void part1(__global float* a, __global float* c)
{
    unsigned int i = get_global_id(0);

    c[i] = 0;
    barrier(CLK_GLOBAL_MEM_FENCE);

    if (i < 9)
    {
        for(int t = 0; t < 2; t++){
            c[i] = c[i+1] + a[i];
            barrier(CLK_GLOBAL_MEM_FENCE);
       }
    }
}

主机应用程序是

import pyopencl as cl
from numpy import *

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

#read in the OpenCL source file as a string
f = open('recurrent.cl', 'r')
fstr = "".join(f.readlines())

#create the program
program = cl.Program(ctx, fstr).build()

mf = cl.mem_flags

#initialize client side (CPU) arrays
a = array(range(10), dtype=float32)

#create OpenCL buffers
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)

#execute program
program.part1(queue, a.shape, None, a_buf, dest_buf)
c = empty_like(a)
cl.enqueue_read_buffer(queue, dest_buf, c).wait()

print "a", a
print "c", c

结果是

a [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
c [  0.   1.   5.   3.   4.  18.  13.   7.   8.   0.]

如您所见,某些结果值是正确的.例如.第三位置= 5 = 3 + 2但例如第二个位置是2 = 0 +2.因此,尽管存在障碍,但总和超过了其他线程在不同时间点的结果.我以为障碍将确保所有线程都已达到目标并将其结果写入全局内存?

As you can see, some of the outcome values are correct. E.g. the third position = 5 = 3 + 2 but e.g. the second position is 2 = 0 + 2. So the summation was over the outcome of other threads at different time-points despite the barrier being in place. I thought the barrier would make sure all threads have reached it and have written their outcome to the global memory?

这可能很简单,我希望您能收到任何提示和评论!

It's probably something very simple and I would appreciate any hints and comments!

PS:我正在使用Intel SDK在Sandy Bridge CPU上运行它.

PS: I am running this on a Sandy Bridge CPU using the Intel SDK.

推荐答案

我想我现在有答案.实际上,OpenCL代码是完全可以的.但是,只有在所有线程都在一个工作组中时,障碍才会出现.情况并非如此,这很容易通过使用get_local_id(0)读取local_id来检查(如Huseyin所建议).在我的情况下,主机为每个线程创建一个工作组-而不是将所有线程都放在一个工作组中.性能方面比较有意义

I think I have the answer now. The OpenCL code was actually completely fine. However, the barriers only kick in if all threads are in one workgroup. This has not been the case, which is easy to check by reading out the local_id using get_local_id(0) (as suggested by Huseyin). In my case the host created a workgroup for every thread - instead of putting all threads in one workgroup. Performance-wise that makes sense, compare

有关全球和本地工作规模的问题

但是,在我们的例子中,我们需要确保线程之间的数据同步,因此所有线程都应该在一个工作组中.为此,我们需要更改程序1的执行.

In our case, however, we need to make sure the data is synchronized between the threads so all of them should be in one workgroup. To this end we need to change the execution of program 1,

program.part1(queue, a.shape, None, a_buf, dest_buf)

第二个参数引用作业的global_size(因此创建的线程数),而第三个参数引用local_size,即每个工作组的线程数.因此,该行应显示为

The second argument refers to the global_size of the job (so the number of threads created), whereas the third seems to refer to the local_size, i.e. the number of threads per workgroup. Thus, this line should read

program.part1(queue, a.shape, a.shape, a_buf, dest_buf)

这将创建一个具有所有线程的工作组(但请注意一个工作组中允许的最大工作人员人数!).现在,该代码仍然无法正常工作.最后一个问题与OpenCL代码中的障碍有关:ID = 10的最后一个线程在循环中看不到障碍,因此所有线程都在等待最后一个碰到障碍(尽管我想知道为什么不这样做).不会抛出异常?).因此,我们只需要减少线程总数(以摆脱最后一个线程),

This creates a workgroup with all threads (but keep an eye on the maximum size of workers allowed in one workgroup!). Now, the code still doesn't work. The last problem is concerned with the barriers in the OpenCL code: the last thread with id = 10 does not see the barriers in the loop and so all threads are waiting for the last one to hit the barrier (though I wonder why that doesn't throw an exception?). So we simply need to reduce the total number of threads (to get rid of the last one),

program.part1(queue, (a.shape[0]-1,), (a.shape[0]-1,), a_buf, dest_buf)

那行得通!在此过程中吸取了一些教训...

That works! Learned some lessons in the process...

再次感谢侯赛因! blue2script

Thanks again to Huseyin! blue2script

这篇关于尽管存在障碍,开放式CL也无法同步的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆