为什么CUDA内存复制速度会这样，一些不断的驱动程序开销？ [英] Why CUDA memory copy speed behaves like this, some constant driver overhead?

查看：249 发布时间：2017/3/4 16:06:51 memory cuda overhead

本文介绍了为什么CUDA内存复制速度会这样，一些不断的驱动程序开销？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的旧GeForce 8800GT上使用CUDA的内存时，我总是有一个奇怪的0.04毫秒的开销。我需要转移〜1-2K到我的设备的恒定记忆，使用它的数据，并从设备只获得一个浮动值。

我有一个典型的使用GPU计算的代码：

  //分配所有需要的内存：pinned，device global 
 for（int i = 0; i <1000; i ++）
 {
 //做一些重的cpu逻辑（约0.005 ms长）
 cudaMemcpyToSymbolAsync（const_dev_mem，pinned_host_mem，mem_size， 0，cudaMemcpyHostToDevice）; 
 my_kernel<<<< 128，128>>（输出）; 
 //不同内核的几个其他调用
 cudaMemcpy（（void *）& host_output，output，sizeof（FLOAT_T），cudaMemcpyDeviceToHost）; 
 //使用返回值做一些逻辑
}

使用此代码与GPU内存工作的速度（注释所有内核调用，添加 cudaDeviceSynchronize 调用）：

  //分配所有需要的内存：pinned，device global 
 for（int i = 0; i< 1000; i ++）
 { 
 //做一些重的cpu逻辑（约0.001 ms）
 cudaMemcpyToSymbolAsync（const_dev_mem，pinned_host_mem，mem_size，0，cudaMemcpyHostToDevice）; 
 cudaMemcpyAsync（（void *）& host_output，output，sizeof（FLOAT_T），cudaMemcpyDeviceToHost）; 
 cudaDeviceSynchronize（）; 
 //对返回值做一些逻辑
}

循环的执行时间和得到〜0.05秒（所以，每次迭代0.05毫秒）。奇怪的是，当我尝试做一些更多的内存工作（添加额外的cudaMemcpyToSymbolAsync和cudaMemcpyAsync调用）我每次调用获得额外<0.01毫秒的时间。它对应于这个家伙的研究： http：//www.cs.virginia .edu /〜mwb7w / cuda_support / memory_transfer_overhead.html

他还得到每次传输1K块到GPU的0.01 ms。
那么0.04 ms（0.05 - 0.01）的开销来自哪里？有任何想法吗？可能是我应该尝试这种代码在一个新的卡？

在我看来，在cudaDeviceSynchronize和CPU代码后，我的GeForce进入一些省电模式或类似的东西。

解决方案

我建议您增加实施的主题数量

  //使用malloc（）在CPU上分配内存。 
 //将mem_size更改为要传输到GPU的总内存。 
 cudaMemcpyToSymbolAsync（const_dev_mem，pinned_host_mem，mem_size，0，cudaMemcpyHostToDevice）; 
 dim3 dimBlock（128,2）; 
 dim3 dimGrid（64000,1）; 
 my_kernel<<< dimgrid，dimBlock>>>（输出）; 
 //不同内核的几个其他调用
 //将大小字段更改为1000 * sizeof（FLOAT_T）
 cudaMemcpy（（void *）& host_output，output，sizeof（FLOAT_T），cudaMemcpyDeviceToHost ）; 
 //对返回值做一些逻辑

如果代码崩溃或更多的GPU内存），使用循环。但是，让他们更少。

I always have a strange 0.04 ms overhead when working with memory in CUDA on my old GeForce 8800GT. I need to transfer ~1-2K to constant memory of my device, work with that data on it and get only one float value from the device.

I have a typical code using GPU calculation:

//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
    //Do some heavy cpu logic (~0.005 ms long)        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    my_kernel<<<128, 128>>>(output);
    //several other calls of different kernels
    cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    // Do some logic with returned value 
}

I decided to measure the speed of work with GPU memory with this code (commented all kernel calls, added cudaDeviceSynchronize call):

//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
    //Do some heavy cpu logic (~0.001 ms long)        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    cudaMemcpyAsync((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    // Do some logic with returned value 
}

I've measured the execution time of the cycle and got ~0.05 sec (so, 0.05 ms per iteration). The strange thing is that when I try to do some more memory work (adding additional cudaMemcpyToSymbolAsync and cudaMemcpyAsync calls) I get additional <0.01 ms time per call. It corresponds with the research of this guy: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html

He also got these 0.01 ms per transfer of 1K block to GPU. So where that 0.04 ms (0.05 - 0.01) overhead came from? Any ideas? May be I should try this code on a newer card?

It seems to me that after cudaDeviceSynchronize and CPU code my GeForce goes to some power saving mode or something like this.

解决方案

I recommend you to increase the number of threads you are implementing

    //Use malloc() to allocate memory on CPU. 
    //Change mem_size to the total memory to be tranferred to GPU.        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    dim3 dimBlock(128,2);
    dim3 dimGrid(64000,1);
    my_kernel<<<dimGrid, dimBlock>>>(output);
    //several other calls of different kernels
    //change size field to 1000*sizeof(FLOAT_T)
    cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    // Do some logic with returned value

If the code crashes (because of more threads or more GPU memory), use loops. But, make them less.

这篇关于为什么CUDA内存复制速度会这样，一些不断的驱动程序开销？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么CUDA内存复制速度会这样，一些不断的驱动程序开销？ [英] Why CUDA memory copy speed behaves like this, some constant driver overhead?

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

为什么CUDA内存复制速度会这样，一些不断的驱动程序开销？ [英] Why CUDA memory copy speed behaves like this, some constant driver overhead?

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭