CUDA设备到主机复制非常慢 [英] CUDA device to host copy very slow

查看:360
本文介绍了CUDA设备到主机复制非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行windows 7 64位,cuda 4.2,visual studio 2010。

I'm running windows 7 64 bits, cuda 4.2, visual studio 2010.

首先,在cuda上运行一些代码,然后将数据下载到主机。然后做一些处理并移回设备。
然后我做了以下从设备到主机的副本,运行速度非常快,像1ms。

First, I run some code on cuda, then download the data back to host. Then do some processing and move back to the device. Then I did the following copy from device to host, it runs very fast, like 1ms.

clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

需要约1ms才能完成。

It takes ~1ms to finish.

然后我又在cuda上运行了一些其他代码,主要是原子操作。然后我将数据从设备复制到主机,这需要很长时间,如〜9s。

Then I ran some other code on the cuda again, mainly atomic operations. Then I copy the data from device to host, it takes very long time, like ~9s.

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

〜9s

多次,例如

int i=0;
while (i<10)
{
clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
i++
}

结果大致相同。

可能是什么问题?

The results are pretty much the same.
What could be the problem?

谢谢!

推荐答案

问题之一是时间,而不是复制性能的任何变化。内核启动在CUDA中是异步的,所以你正在测量的不仅仅是 thrust :: copy 的时间,而且是你启动完成的以前的内核。如果你改变你的代码的时间复制操作到这样:

The problem is one of timing, not of any change in copy performance. Kernel launches are asynchronous in CUDA, so what you are measuring is not just the time for thrust::copy but also for the prior kernel you launched to complete. If you change you code for timing the copy operation to something like this:

cudaDeviceSynchronize(); // wait until prior kernel is finished
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

您应该会发现传输时间已恢复到以前的性能。所以你真正的问题不是为什么是 thrust :: copy slow,它是为什么是我的内核慢。根据你发布的相当可怕的伪代码,答案是因为它充满了 atomicExch()调用序列化内核内存事务。

You should find the transfer times are restored to their previous performance. So you real question isn't "why is thrust::copy slow", it is "why is my kernel slow". And based on the rather terrible pseudo code you posted, the answer is "because it is full of atomicExch() calls which serialise kernel memory transactions".

这篇关于CUDA设备到主机复制非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆