clEnqueueNDRange是否在Nvidia硬件上阻塞? (也支持多GPU) [英] clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

查看:142
本文介绍了clEnqueueNDRange是否在Nvidia硬件上阻塞? (也支持多GPU)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Nvidia GPU上,当我调用clEnqueueNDRange时,程序将等待其完成后再继续.更准确地说,我称其为等效的C ++绑定CommandQueue::enqueueNDRange,但这没有什么区别.这仅在Nvidia硬件(3台Tesla M2090s)上远程发生.在我们的装有AMD GPU的办公室工作站上,该呼叫是无阻塞的,并立即返回.我没有要测试的本地Nvidia硬件-我们曾经使用过,我也记得类似的行为,但这有点朦胧.

On Nvidia GPUs, when I call clEnqueueNDRange, the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange, but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a bit hazy.

这使得跨多个GPU分散工作变得更加困难.我已经尝试在新的C ++ 11规范中使用std::async/std::finish为每个对enqueueNDRange的调用启动一个新线程,但这似乎也不起作用-监视nvidia-smi中的GPU使用情况,我可以看到GPU 0上的内存使用率上升,然后它做了一些工作,然后GPU 0上的内存下降了,而GPU 1上的内存上升了,有人做了一些工作,依此类推.我的gcc版本是4.7.0

This makes spreading the work across multiple GPUs harder. I've tried starting a new thread for each call to enqueueNDRange using std::async/std::finish in the new C++11 spec, but that doesn't seem to work either - monitoring the GPU usage in nvidia-smi, I can see that the memory usage on GPU 0 goes up, then it does some work, then the memory on GPU 0 goes down and the memory on GPU 1 goes up, that one does some work, etc. My gcc version is 4.7.0.

这是我启动内核的方式,其中增量是所需的全局工作量除以设备数量,四舍五入到所需的本地工作量的最接近倍数:

Here's how I'm starting the kernels, where increment is the desired global work size divided by the number of devices, rounded up to the nearest multiple of the desired local work size:

std::vector<cl::CommandQueue> queues;
/* Population of queues happens somewhere
cl::NDrange offset, increment, local;
std::vector<std::future<cl_int>> enqueueReturns;
int numDevices  = queues.size();

/* Calculation of increment (local is gotten from the function parameters)*/

//Distribute the job among each of the devices in the context
for(int i = 0; i < numDevices; i++)
{   
    //Update the offset for the current device
    offset = cl::NDRange(i*increment[0], i*increment[1], i*increment[2]);

    //Start a new thread for each call to enqueueNDRangeKernel
    enqueueReturns.push_back(std::async(
                   std::launch::async,
                   &cl::CommandQueue::enqueueNDRangeKernel,
                   &queues[i],
                   kernels[kernel],
                   offset,
                   increment,
                   local,
                   (const std::vector<cl::Event>*)NULL,
                   (cl::Event*)NULL));
    //Without those last two casts, the program won't even compile
}   
//Wait for all threads to join before returning
for(int i = 0; i < numDevices; i++)
{   
    execError = enqueueReturns[i].get();

    if(execError != CL_SUCCESS)
        std::cerr << "Informative error omitted due to length" << std::endl
}   

绝对可以在对std::async的调用上运行内核,因为我可以创建一个小的虚拟函数,在GDB中为其设置一个断点,并在调用std::async时将其插入其中.但是,如果我为enqueueNDRangeKernel创建包装函数,然后在其中运行,然后在运行后放入print语句,则可以看到两次打印之间需要花费一些时间.

The kernels definitely should be running on the call to std::async, since I can create a little dummy function, set a breakpoint on it in GDB and have it step into it the moment std::async is called. However, if I make a wrapper function for enqueueNDRangeKernel, run it there, and put in a print statement after the run, I can see that it takes some time between prints.

P.S. Nvidia开发区由于黑客等原因而瘫痪,因此我无法在此处发布问题.

P.S. The Nvidia dev zone is down due to hackers and such, so I haven't been able to post the question there.

忘了提及-我作为参数传递给内核的缓冲区(以及上面提到的那个似乎在GPU之间传递的缓冲区)被声明为使用CL_MEM_COPY_HOST_PTR.我一直在使用CL_READ_WRITE_BUFFER,发生的效果相同.

Forgot to mention - The buffer that I'm passing to the kernel as an argment (and the one I mention, above, that seems to get passed between the GPUs) is declared as using CL_MEM_COPY_HOST_PTR. I had been using CL_READ_WRITE_BUFFER, with the same effect happening.

推荐答案

我给Nvidia家伙发了电子邮件,实际上得到了相当公平的答复. Nvidia SDK中有一个示例显示,您需要为每个设备创建单独的设备:

I emailed the Nvidia guys and actually got a pretty fair response. There's a sample in the Nvidia SDK that shows, for each device you need to create seperate:

  • 队列-这样您就可以代表每个设备并将其排队入订单
  • 缓冲区-每个需要传递给设备的阵列的缓冲区,否则设备将绕过一个缓冲区,等待其可用并有效地序列化所有内容.
  • 内核-我认为它是可选的,但是它使指定参数变得容易得多.
  • queues - So you can represent each device and enqueue orders to it
  • buffers - One buffer for each array you need to pass to the device, otherwise the devices will pass around a single buffer, waiting for it to become available and effectively serializing everything.
  • kernel - I think this one's optional, but it makes specifying arguments a lot easier.

此外,您必须在单独的线程中为每个队列调用EnqueueNDRangeKernel.这不是SDK示例中的内容,但是Nvidia的家伙证实了这些呼叫正在阻塞.

Furthermore, you have to call EnqueueNDRangeKernel for each queue in separate threads. That's not in the SDK sample, but the Nvidia guy confirmed that the calls are blocking.

完成所有这些操作后,我在多个GPU上实现了并发.但是,仍然存在一些问题. 转到下一个问题...

After doing all this, I achieved concurrency on multiple GPUs. However, there's still a bit of a problem. On to the next question...

这篇关于clEnqueueNDRange是否在Nvidia硬件上阻塞? (也支持多GPU)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆