clEnqueueNDRange是否在Nvidia硬件上阻塞? (也支持多GPU) [英] clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

查看：142 发布时间：2020/5/20 18:51:24 opencl nvidia

本文介绍了clEnqueueNDRange是否在Nvidia硬件上阻塞? (也支持多GPU)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Nvidia GPU上，当我调用clEnqueueNDRange时，程序将等待其完成后再继续.更准确地说，我称其为等效的C ++绑定CommandQueue::enqueueNDRange，但这没有什么区别.这仅在Nvidia硬件(3台Tesla M2090s)上远程发生.在我们的装有AMD GPU的办公室工作站上，该呼叫是无阻塞的，并立即返回.我没有要测试的本地Nvidia硬件-我们曾经使用过，我也记得类似的行为，但这有点朦胧.

On Nvidia GPUs, when I call clEnqueueNDRange, the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange, but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a bit hazy.

这使得跨多个GPU分散工作变得更加困难.我已经尝试在新的C ++ 11规范中使用std::async/std::finish为每个对enqueueNDRange的调用启动一个新线程，但这似乎也不起作用-监视nvidia-smi中的GPU使用情况，我可以看到GPU 0上的内存使用率上升，然后它做了一些工作，然后GPU 0上的内存下降了，而GPU 1上的内存上升了，有人做了一些工作，依此类推.我的gcc版本是4.7.0

This makes spreading the work across multiple GPUs harder. I've tried starting a new thread for each call to enqueueNDRange using std::async/std::finish in the new C++11 spec, but that doesn't seem to work either - monitoring the GPU usage in nvidia-smi, I can see that the memory usage on GPU 0 goes up, then it does some work, then the memory on GPU 0 goes down and the memory on GPU 1 goes up, that one does some work, etc. My gcc version is 4.7.0.

这是我启动内核的方式，其中增量是所需的全局工作量除以设备数量，四舍五入到所需的本地工作量的最接近倍数:

Here's how I'm starting the kernels, where increment is the desired global work size divided by the number of devices, rounded up to the nearest multiple of the desired local work size:

std::vector<cl::CommandQueue> queues;
/* Population of queues happens somewhere
cl::NDrange offset, increment, local;
std::vector<std::future<cl_int>> enqueueReturns;
int numDevices  = queues.size();

/* Calculation of increment (local is gotten from the function parameters)*/

//Distribute the job among each of the devices in the context
for(int i = 0; i < numDevices; i++)
{   
    //Update the offset for the current device
    offset = cl::NDRange(i*increment[0], i*increment[1], i*increment[2]);

    //Start a new thread for each call to enqueueNDRangeKernel
    enqueueReturns.push_back(std::async(
                   std::launch::async,
                   &cl::CommandQueue::enqueueNDRangeKernel,
                   &queues[i],
                   kernels[kernel],
                   offset,
                   increment,
                   local,
                   (const std::vector<cl::Event>*)NULL,
                   (cl::Event*)NULL));
    //Without those last two casts, the program won't even compile
}   
//Wait for all threads to join before returning
for(int i = 0; i < numDevices; i++)
{   
    execError = enqueueReturns[i].get();

    if(execError != CL_SUCCESS)
        std::cerr << "Informative error omitted due to length" << std::endl
}

绝对可以在对std::async的调用上运行内核，因为我可以创建一个小的虚拟函数，在GDB中为其设置一个断点，并在调用std::async时将其插入其中.但是，如果我为enqueueNDRangeKernel创建包装函数，然后在其中运行，然后在运行后放入print语句，则可以看到两次打印之间需要花费一些时间.

The kernels definitely should be running on the call to std::async, since I can create a little dummy function, set a breakpoint on it in GDB and have it step into it the moment std::async is called. However, if I make a wrapper function for enqueueNDRangeKernel, run it there, and put in a print statement after the run, I can see that it takes some time between prints.

P.S. Nvidia开发区由于黑客等原因而瘫痪，因此我无法在此处发布问题.

P.S. The Nvidia dev zone is down due to hackers and such, so I haven't been able to post the question there.

忘了提及-我作为参数传递给内核的缓冲区(以及上面提到的那个似乎在GPU之间传递的缓冲区)被声明为使用CL_MEM_COPY_HOST_PTR.我一直在使用CL_READ_WRITE_BUFFER，发生的效果相同.

Forgot to mention - The buffer that I'm passing to the kernel as an argment (and the one I mention, above, that seems to get passed between the GPUs) is declared as using CL_MEM_COPY_HOST_PTR. I had been using CL_READ_WRITE_BUFFER, with the same effect happening.

clEnqueueNDRange是否在Nvidia硬件上阻塞? (也支持多GPU) [英] clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

clEnqueueNDRange是否在Nvidia硬件上阻塞? (也支持多GPU) [英] clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭