CUDA流不重叠 [英] CUDA streams not overlapping

查看:304
本文介绍了CUDA流不重叠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常相似的代码:

  int k,no_streams = 4; 
cudaStream_t stream [no_streams];
for(k = 0; k
cudaMalloc(& g_in,size1 * no_streams);
cudaMalloc(& g_out,size2 * no_streams);

for(k = 0; k cudaMemcpyAsync(g_in + k * size1 / sizeof(float),h_ptr_in [k],size1,cudaMemcpyHostToDevice,stream [k ]);

for(k = 0; k mykernel >(g_in + k * size1 / sizeof(float),g_out + k * size2 / sizeof(float));

for(k = 0; k cudaMemcpyAsync(h_ptr_out [k],g_out + k * size2 / sizeof(float),size2,cudaMemcpyDeviceToHost,stream ]);

cudaThreadSynchronize();

cudaFree(g_in);
cudaFree(g_out);

'h_ptr_in'和'h_ptr_out'是使用cudaMallocHost(没有标志)分配的指针数组。 / p>

问题是流不重叠。
在可视化分析器中,我可以看到第一个流中的内核执行与第二个流中的副本(H2D)重叠,但没有其他重叠。



I可能没有资源运行2内核(我想我做),但至少内核执行和副本应该重叠,对吧?
如果我把所有3(复制H2D,内核执行,复制D2H)在同一个for循环中没有重叠...



请帮助

我正在运行:



Ubuntu 10.04 x64



设备:GeForce GTX 460
(CUDA驱动程序版本:3.20,
CUDA运行时版本:3.20,
CUDA功能主要/次要版本号:2.1,
并发复制和执行:是,
并发内核执行:是)

解决方案

要在NVIDIA论坛上此帖,分析器将序列化流式处理以获得准确的时间数据。如果您认为您的时间已关闭,请确保您使用的是CUDA事件...



我最近一直在尝试流式传输,我发现 simpleMultiCopy示例从真正有用,特别是对于相应的逻辑和同步。


I have something very similar to the code:

int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);

cudaMalloc(&g_in,  size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);

for (k = 0; k < no_streams; k++)
  mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);

cudaThreadSynchronize();

cudaFree(g_in);
cudaFree(g_out);

'h_ptr_in' and 'h_ptr_out' are arrays of pointers allocated with cudaMallocHost (with no flags).

The problem is that the streams do not overlap. In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.

I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right? And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap...

Please HELP, what can be causing this?

I'm running on:

Ubuntu 10.04 x64

Device: "GeForce GTX 460" (CUDA Driver Version: 3.20, CUDA Runtime Version: 3.20, CUDA Capability Major/Minor version number: 2.1, Concurrent copy and execution: Yes, Concurrent kernel execution: Yes)

解决方案

According to this post on the NVIDIA forums, the profiler will serialize streaming to get accurate timing data. If you think your timings are off, make sure you're using CUDA events...

I've been experimenting with streaming lately, and I found the "simpleMultiCopy" example from the SDK to be really helpful, particularly with the appropriate logic and synchronizations.

这篇关于CUDA流不重叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆