CUDA多GPU执行中的并发 [英] Concurrency in CUDA multi-GPU executions

查看:351
本文介绍了CUDA多GPU执行中的并发的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在多GPU系统上运行cuda内核函数,使用 4 GPU。我预计他们将同时推出,但他们不是。我测量每个内核的开始时间,第二个内核在第一个内核完成执行后启动。因此,在 4 GPU上启动内核不会比 1 单GPU更快。



我如何让他们同时工作?



这是我的代码:

  cudaSetDevice(0); 
GPU_kernel<<< gridDim,threadsPerBlock>>(d_result_0,parameterA +(0 * rateA),parameterB +(0 * rateB));
cudaMemcpyAsync(h_result_0,d_result_0,mem_size_result,cudaMemcpyDeviceToHost);

cudaSetDevice(1);
GPU_kernel<<< gridDim,threadsPerBlock>>(d_result_1,参数A +(1 * rateA),参数B +(1 * rateB)
cudaMemcpyAsync(h_result_1,d_result_1,mem_size_result,cudaMemcpyDeviceToHost);

cudaSetDevice(2);
GPU_kernel<<< gridDim,threadsPerBlock>>(d_result_2,parameterA +(2 * rateA),parameterB +(2 * rateB));
cudaMemcpyAsync(h_result_2,d_result_2,mem_size_result,cudaMemcpyDeviceToHost);

cudaSetDevice(3);
GPU_kernel<<< gridDim,threadsPerBlock>>(d_result_3,parameterA +(3 * rateA),parameterB +(3 * rateB));
cudaMemcpyAsync(h_result_3,d_result_3,mem_size_result,cudaMemcpyDeviceToHost);


解决方案

您可能需要使用 cudaMemcpyAsync cudaMemcpy 是阻塞调用,所以它不会返回执行到你的代码,在它完成之前,所以你的代码只是不切换GPU,在完成当前的例程。 / p>

然而,内核调用是异步的(对于CPU),所以你发布的代码可能会导致一些竞争条件( cudaMemcpy 可能会在内核完成之前开始执行)。
正如@talonmies在注释中指出的,因为 cudaMemcpy / cudaMemcpyAsync 与内核启动进入相同的流,一切都按正确的顺序执行。



会推荐你使用CUDA Streams; 这里简要介绍了使用流的MultiGPU编程。这在你的情况不是很有帮助,但可能非常方便使用在更复杂的应用程序,例如。如果您需要在不同设备之间同步函数调用。


I'm running a cuda kernel function on a multiple GPUs system, with 4 GPUs. I've expected them to be launched concurrently, but they are not. I measure the starting time of each kernel, and the second kernel starts after the first one finishes its execution. So launching the kernel on 4 GPUs is not faster than 1 single GPU.

How can I make them work concurrently?

This is my code:

cudaSetDevice(0);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_0, parameterA +(0*rateA), parameterB + (0*rateB));
cudaMemcpyAsync(h_result_0, d_result_0, mem_size_result, cudaMemcpyDeviceToHost);

cudaSetDevice(1);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_1, parameterA +(1*rateA), parameterB + (1*rateB));
cudaMemcpyAsync(h_result_1, d_result_1, mem_size_result, cudaMemcpyDeviceToHost);

cudaSetDevice(2);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_2, parameterA +(2*rateA), parameterB + (2*rateB));
cudaMemcpyAsync(h_result_2, d_result_2, mem_size_result, cudaMemcpyDeviceToHost);

cudaSetDevice(3);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_3, parameterA +(3*rateA), parameterB + (3*rateB));
cudaMemcpyAsync(h_result_3, d_result_3, mem_size_result, cudaMemcpyDeviceToHost);

解决方案

You might need to use cudaMemcpyAsync. cudaMemcpy is blocking call, so it does not return execution to your code before it finishes, so your code just does not switch GPU before it completes the routine for the current one.

However, kernel calls are asynchronous (for CPU), so the code you posted is likely to cause some race-conditions (cudaMemcpy might start executing before kernel finishes). As @talonmies pointed out in the comments, since cudaMemcpy/cudaMemcpyAsync goes into the same stream as kernel launch, everything is executed in right order.

I would recommend you to use CUDA Streams; here is a brief introduction to MultiGPU programming using streams. It's not very helpful in your case, but might be very convenient to use in more complex applications, e.g. if you need to synchronize function calls between different devices.

这篇关于CUDA多GPU执行中的并发的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆