CUDA 中 Malloc 函数的效率 [英] Efficiency of Malloc function in CUDA
问题描述
我正在尝试将一些 CPU 代码移植到 CUDA.我的 CUDA 卡是基于 Fermi 架构的,因此我可以使用设备中的 malloc() 函数来动态分配内存,并且不需要对原始代码进行大量更改.(malloc() 函数在我的代码中被多次调用.)我的问题是这个 malloc 函数是否足够高效,或者我们应该尽可能避免使用它.我在 CUDA 上运行我的代码并没有得到太多的加速,我怀疑这是由于使用了 malloc() 函数造成的.
I am trying to port some CPU codes into CUDA. My CUDA card is based on Fermi architecture, and therefore I can use the malloc() function in the device to dynamically allocate memory and don't need to change the original codes a lot. (The malloc() function is called many times in my codes.) My question is if this malloc function is efficient enough, or we should avoid to use it if possible. I don't get much speedup running my codes on CUDA, and I doubt this is caused by the use of malloc() function.
如果您有任何建议或意见,请告诉我.感谢您的帮助.
Please let me know if you have any suggestion or comment. I appreciate your help.
推荐答案
目前的设备malloc实现很慢(已经发表过关于高效CUDA动态内存分配的论文,但是那个工作还没有出现在发布工具包中,据我所知).它分配的内存来自堆,是存储全局内存,而且速度也很慢.除非您有非常令人信服的理由这样做,否则我建议避免在内核动态内存分配中.这将对整体性能产生负面影响.它是否真的对你的代码有很大影响是一个完全独立的问题.
The current device malloc implementation is very slow (there has been papers published about efficient CUDA dynamic memory allocation, but that work has not yet appeared in a release toolkit, AFAIK). The memory it allocates comes from heap, which is stored global memory, and it is also very slow. Unless you have a very compelling reason to do so, I would recommend avoiding in kernel dynamic memory allocation. It will have a negative effect on overall performance. Whether it is actually have much effect on your code is a completely separate question.
这篇关于CUDA 中 Malloc 函数的效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!