CUDA中的全局内存与动态全局内存分配 [英] Global Memory vs. Dynamic Global Memory Allocation in CUDA

查看:825
本文介绍了CUDA中的全局内存与动态全局内存分配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CUDA(v5.5)应用程序,需要使用全局内存。理想情况下,我更喜欢使用常量内存,但是我已经用完了常量内存,并且溢出将必须放在全局内存中。我也有一些变量需要偶尔写入(在GPU上进行一些归约运算后),然后将其放入全局内存中。

I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.

为了阅读,我将以一种简单的方式访问全局内存。我的内核在for循环内调用,并且在每次调用内核时,每个线程都将访问完全相同的全局内存地址而没有偏移。为了进行编写,在每次内核调用之后,都会在GPU上执行缩减操作,而且我必须在循环的下一次迭代之前将结果写入全局内存。但是,在我的应用程序中,对全局内存的读取要比对全局内存的写入要多得多。

For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.

我的问题是,使用在全局(变量)范围内声明的全局内存是否有任何优势使用动态分配的全局内存?我所需的全局内存量将取决于应用程序,因此出于这个原因,动态分配将是更可取的。我知道全局内存使用的上限,但是我更关心性能,因此也有可能使用较大的固定分配静态地声明内存,而我肯定不会溢出。考虑到性能,是否有理由偏爱一种形式的全局内存分配而不是另一种形式?它们是否存在于GPU上的同一物理位置,是否以相同的方式缓存,或者两种形式的读取成本是否不同?

My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?

推荐答案

全局内存静态分配 a>(使用 __ device __ ),动态(使用设备 malloc new code>),并通过 CUDA运行时(例如使用 cudaMalloc )。

Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).

以上所有方法实际上都分配相同类型的记忆,我.e。从板载(而非片上)DRAM子系统中雕刻出来的内存。无论分配方式如何,该内存都具有相同的访问,合并和缓存规则(因此具有相同的一般性能考虑)。

All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).

由于动态分配会占用一些非-零时间,通过在程序开始时使用静态(即 __ device __ )方法或通过进行一次分配,可以提高代码的性能。运行时API(即 cudaMalloc 等),这避免了花时间在代码的性能敏感区域动态分配内存。

Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.

还要注意,我概述的3种方法,虽然从设备代码中具有类似C / C ++的访问方法,但与主机的访问方法却有所不同。使用运行时API函数(如 cudaMemcpyToSymbol cudaMemcpyFromSymbol )访问静态分配的内存,通过普通 cudaMalloc / cudaMemcpy 类型的函数,以及动态分配的全局内存(设备 new malloc )不能直接从主机访问。

Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.

这篇关于CUDA中的全局内存与动态全局内存分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆