CUDA:合并的全局内存访问比共享内存快?另外,分配一个大的共享内存阵列会减慢程序吗? [英] CUDA: Is coalesced global memory access faster than shared memory? Also, does allocating a large shared memory array slow down the program?

查看:387
本文介绍了CUDA:合并的全局内存访问比共享内存快?另外,分配一个大的共享内存阵列会减慢程序吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在NVIDIA Tesla M2050
上共享内存的速度没有提高,每个块有大约49K的共享内存。实际上,如果我在共享内存中分配
一个大的char数组,它会减慢我的程序。例如

I'm not finding an improvement in speed with shared memory on an NVIDIA Tesla M2050 with about 49K shared memory per block. Actually if I allocate a large char array in shared memory it slows down my program. For example

__shared__ char database[49000];

会使运行时间比

__shared__ char database[4900];

程序只访问数据库的前100个字符,所以不需要额外的空间
。我不知道为什么会发生这种情况。任何帮助将不胜感激。
感谢。

The program accesses only the first 100 chars of database so the extra space is unnecessary. I can't figure out why this is happening. Any help would be appreciated. Thanks.

推荐答案

使用较大数组时CUDA共享内存性能相对较差的原因可能是每个多处理器具有有限的可用共享内存。

The reason for the relatively poor performance of CUDA shared memory when using larger arrays may have to do with the fact that each multiprocessor has a limited amount of available shared memory.

每个多处理器都有多个处理器;对于现代设备,通常为32,经线中的线程数。这意味着,在没有发散或存储器停顿的情况下,平均处理速率为每个周期32个指令(由于流水线而延迟时间较长)。

Each multiprocessor hosts several processors; for modern devices, typically 32, the number of threads in a warp. This means that, in the absence of divergence or memory stalls, the average processing rate is 32 instructions per cycle (latency is high due to pipelining).

CUDA调度几个块到多处理器。每个块由多个经线组成。当全局存储器访问上的翘曲停滞(即使合并的访问具有高延迟),也处理其他的warp。这有效隐藏延迟,这是为什么高延迟全局内存在GPU中可以接受的原因。要有效隐藏延迟,您需要足够的额外的翘曲来执行,直到停滞的翘曲可以继续。如果所有warps在内存访问时停止,您就不能再隐藏延迟。

CUDA schedules several blocks to a multiprocessor. Each block consists of several warps. When a warp stalls on a global memory access (even coalesced accesses have high latency), other warps are processed. This effectively hides the latency, which is why high-latency global memory is acceptable in GPUs. To effectively hide latency, you need enough extra warps to execute until the stalled warp can continue. If all warps stall on memory accesses, you can no longer hide the latency.

共享内存分配给CUDA中的块,并存储在GPU设备上的单个多处理器。每个多处理器具有相对小的,固定量的共享存储器空间。 CUDA不能在多处理器可以支持的共享内存和寄存器使用方面调度更多的块到多处理器。换句话说,如果多处理器上的共享内存量为X,并且每个块需要Y共享内存,CUDA一次只能为每个多处理器调度不超过(X / Y)个块(它可能较少,因为有其他约束,如寄存器使用)。

Shared memory is allocated to blocks in CUDA, and stored on a singly multiprocessor on the GPU device. Each multiprocessor has a relatively small, fixed amount of shared memory space. CUDA cannot schedule more blocks to multiprocessors than the multiprocessors can support in terms of shared memory and register usage. In other words, if the amount of shared memory on a multiprocessor is X and each block requires Y shared memory, CUDA will schedule no more than floor(X/Y) blocks at a time to each multiprocessor (it might be less since there are other constraints, such as register usage).

通过增加块的共享内存使用,可能会减少活动卷的数量内核,从而损害性能。你应该通过使用-Xptxas = - v标志编译来查看你的内核代码;这应该给你注册和共享&每个内核的恒定内存使用。在最新版本的CUDA入住率计算器中使用此数据和内核启动参数以及其他所需信息,以确定是否可能会受到占用率的影响。

Ergo, by increasing shared memory usage of a block, you might be reducing the number of active warps - the occupancy - of your kernel, thereby hurting performance. You should look into your kernel code by compiling with the -Xptxas="-v" flag; this should give you register and shared & constant memory usage for each kernel. Use this data and your kernel launch parameters, as well as other required information, in the most recent version of the CUDA Occupancy Calculator to determine whether you might be affected by occupancy.

EDIT:

为了解决你的问题的另一部分,假设没有共享内存条冲突和完全合并全局内存访问...这个答案有两个维度:延迟和带宽。共享存储器的延迟将低于全局存储器的延迟,因为共享存储器在芯片上。带宽将是相同的。如果你能够通过合并隐藏全局内存访问延迟,没有惩罚(注意:访问模式在这里很重要,因为共享内存允许潜在的更多样的访问模式,几乎没有性能损失,因此可以即使您可以隐藏所有全局内存延迟,也可以使用共享内存)。

To address the other part of your question, assuming no shared memory bank conflicts and perfect coalescing of global memory accesses... there are two dimensions to this answer: latency and bandwidth. The latency of shared memory will be lower than that of global memory, since shared memory is on-chip. The bandwidth will be much the same. Ergo, if you are able to hide global memory access latency through coalescing, there is no penalty (note: the access pattern is important here, in that shared memory allows for potentially more diverse access patterns with little to no performance loss, so there can be benefits to using shared memory even if you can hide all the global memory latency).

这篇关于CUDA:合并的全局内存访问比共享内存快?另外,分配一个大的共享内存阵列会减慢程序吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆