CUDA 中 Malloc 函数的效率 [英] Efficiency of Malloc function in CUDA

查看:37
本文介绍了CUDA 中 Malloc 函数的效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一些 CPU 代码移植到 CUDA.我的 CUDA 卡是基于 Fermi 架构的,因此我可以使用设备中的 malloc() 函数来动态分配内存,并且不需要对原始代码进行大量更改.(malloc() 函数在我的代码中被多次调用.)我的问题是这个 malloc 函数是否足够高效,或者我们应该尽可能避免使用它.我在 CUDA 上运行我的代码并没有得到太多的加速,我怀疑这是由于使用了 malloc() 函数造成的.

I am trying to port some CPU codes into CUDA. My CUDA card is based on Fermi architecture, and therefore I can use the malloc() function in the device to dynamically allocate memory and don't need to change the original codes a lot. (The malloc() function is called many times in my codes.) My question is if this malloc function is efficient enough, or we should avoid to use it if possible. I don't get much speedup running my codes on CUDA, and I doubt this is caused by the use of malloc() function.

如果您有任何建议或意见,请告诉我.感谢您的帮助.

Please let me know if you have any suggestion or comment. I appreciate your help.

推荐答案

目前的设备malloc实现很慢(已经发表过关于高效CUDA动态内存分配的论文,但是那个工作还没有出现在发布工具包中,据我所知).它分配的内存来自堆,是存储全局内存,而且速度也很慢.除非您有非常令人信服的理由这样做,否则我建议避免在内核动态内存分配中.这将对整体性能产生负面影响.它是否真的对你的代码有很大影响是一个完全独立的问题.

The current device malloc implementation is very slow (there has been papers published about efficient CUDA dynamic memory allocation, but that work has not yet appeared in a release toolkit, AFAIK). The memory it allocates comes from heap, which is stored global memory, and it is also very slow. Unless you have a very compelling reason to do so, I would recommend avoiding in kernel dynamic memory allocation. It will have a negative effect on overall performance. Whether it is actually have much effect on your code is a completely separate question.

这篇关于CUDA 中 Malloc 函数的效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆