CUDA设备到设备转移昂贵 [英] CUDA Device To Device transfer expensive
本文介绍了CUDA设备到设备转移昂贵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我写了一些代码,试图交换用于FFT目的的2D矩阵的象限,存储在平面数组中。
I have written some code to try to swap quadrants of a 2D matrix for FFT purposes, that is stored in a flat array.
int leftover = W-dcW;
T *temp;
T *topHalf;
cudaMalloc((void **)&temp, dcW * sizeof(T));
//swap every row, left and right
for(int i = 0; i < H; i++)
{
cudaMemcpy(temp, &data[i*W], dcW*sizeof(T),cudaMemcpyDeviceToDevice);
cudaMemcpy(&data[i*W],&data[i*W+dcW], leftover*sizeof(T), cudaMemcpyDeviceToDevice);
cudaMemcpy(&data[i*W+leftover], temp, dcW*sizeof(T), cudaMemcpyDeviceToDevice);
}
cudaMalloc((void **)&topHalf, dcH*W* sizeof(T));
leftover = H-dcH;
cudaMemcpy(topHalf, data, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);
cudaMemcpy(data, &data[dcH*W], leftover*W*sizeof(T), cudaMemcpyDeviceToDevice);
cudaMemcpy(&data[leftover*W], topHalf, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);
请注意,此代码需要设备指针,DeviceToDevice才会进行传输。
Notice that this code takes device pointers, and does DeviceToDevice transfers.
为什么这样运行这么慢?这可以优化以某种方式吗?
Why does this seem to run so slow? Can this be optimized somehow? I timed this compared to the same operation on host using regular memcpy and it was about 2x slower.
任何想法?
推荐答案
我最后写了一个内核来做交换。这确实比设备到设备memcpy操作更快
I ended up writing a kernel to do the swaps. This was indeed faster than the Device to Device memcpy operations
这篇关于CUDA设备到设备转移昂贵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文