为什么CUDA内存复制速度会这样,一些不断的驱动程序开销? [英] Why CUDA memory copy speed behaves like this, some constant driver overhead?
问题描述
我有一个典型的使用GPU计算的代码:
//分配所有需要的内存:pinned,device global
for(int i = 0; i <1000; i ++)
{
//做一些重的cpu逻辑(约0.005 ms长)
cudaMemcpyToSymbolAsync(const_dev_mem,pinned_host_mem,mem_size, 0,cudaMemcpyHostToDevice);
my_kernel<<<< 128,128>>(输出);
//不同内核的几个其他调用
cudaMemcpy((void *)& host_output,output,sizeof(FLOAT_T),cudaMemcpyDeviceToHost);
//使用返回值做一些逻辑
}
使用此代码与GPU内存工作的速度(注释所有内核调用,添加 cudaDeviceSynchronize
调用):
//分配所有需要的内存:pinned,device global
for(int i = 0; i< 1000; i ++)
{
//做一些重的cpu逻辑(约0.001 ms)
cudaMemcpyToSymbolAsync(const_dev_mem,pinned_host_mem,mem_size,0,cudaMemcpyHostToDevice);
cudaMemcpyAsync((void *)& host_output,output,sizeof(FLOAT_T),cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
//对返回值做一些逻辑
}
循环的执行时间和得到〜0.05秒(所以,每次迭代0.05毫秒)。奇怪的是,当我尝试做一些更多的内存工作(添加额外的cudaMemcpyToSymbolAsync和cudaMemcpyAsync调用)我每次调用获得额外<0.01毫秒的时间。它对应于这个家伙的研究: http://www.cs.virginia .edu /〜mwb7w / cuda_support / memory_transfer_overhead.html
他还得到每次传输1K块到GPU的0.01 ms。
那么0.04 ms(0.05 - 0.01)的开销来自哪里?有任何想法吗?可能是我应该尝试这种代码在一个新的卡?
在我看来,在cudaDeviceSynchronize和CPU代码后,我的GeForce进入一些省电模式或类似的东西。
我建议您增加实施的主题数量
//使用malloc()在CPU上分配内存。
//将mem_size更改为要传输到GPU的总内存。
cudaMemcpyToSymbolAsync(const_dev_mem,pinned_host_mem,mem_size,0,cudaMemcpyHostToDevice);
dim3 dimBlock(128,2);
dim3 dimGrid(64000,1);
my_kernel<<< dimgrid,dimBlock>>>(输出);
//不同内核的几个其他调用
//将大小字段更改为1000 * sizeof(FLOAT_T)
cudaMemcpy((void *)& host_output,output,sizeof(FLOAT_T),cudaMemcpyDeviceToHost );
//对返回值做一些逻辑
如果代码崩溃或更多的GPU内存),使用循环。但是,让他们更少。
I always have a strange 0.04 ms overhead when working with memory in CUDA on my old GeForce 8800GT. I need to transfer ~1-2K to constant memory of my device, work with that data on it and get only one float value from the device.
I have a typical code using GPU calculation:
//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
//Do some heavy cpu logic (~0.005 ms long)
cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
my_kernel<<<128, 128>>>(output);
//several other calls of different kernels
cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
// Do some logic with returned value
}
I decided to measure the speed of work with GPU memory with this code (commented all kernel calls, added cudaDeviceSynchronize
call):
//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
//Do some heavy cpu logic (~0.001 ms long)
cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
cudaMemcpyAsync((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
// Do some logic with returned value
}
I've measured the execution time of the cycle and got ~0.05 sec (so, 0.05 ms per iteration). The strange thing is that when I try to do some more memory work (adding additional cudaMemcpyToSymbolAsync and cudaMemcpyAsync calls) I get additional <0.01 ms time per call. It corresponds with the research of this guy: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html
He also got these 0.01 ms per transfer of 1K block to GPU. So where that 0.04 ms (0.05 - 0.01) overhead came from? Any ideas? May be I should try this code on a newer card?
It seems to me that after cudaDeviceSynchronize and CPU code my GeForce goes to some power saving mode or something like this.
I recommend you to increase the number of threads you are implementing
//Use malloc() to allocate memory on CPU.
//Change mem_size to the total memory to be tranferred to GPU.
cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
dim3 dimBlock(128,2);
dim3 dimGrid(64000,1);
my_kernel<<<dimGrid, dimBlock>>>(output);
//several other calls of different kernels
//change size field to 1000*sizeof(FLOAT_T)
cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
// Do some logic with returned value
If the code crashes (because of more threads or more GPU memory), use loops. But, make them less.
这篇关于为什么CUDA内存复制速度会这样,一些不断的驱动程序开销?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!