什么时候为共享内存填充真的需要? [英] When is padding for shared memory really required?

查看:192
本文介绍了什么时候为共享内存填充真的需要?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对来自NVidia的2个文档感到困惑。 CUDA最佳实践描述了共享内存组织在bank中,一般在32位模式下,每4个字节是一个bank(这是我的理解)。但是,使用CUDA的并行前缀总和(扫描)(位于此处: http:// http。 developer.nvidia.com/GPUGems3/gpugems3_ch39.html )详细介绍了如何将填充添加到扫描算法中,因为存储库冲突。

I am confused by 2 documents from NVidia. "CUDA Best Practices" describes that shared memory is organized in banks, and in general in 32-bit mode each 4 bytes is a bank (that is how I understood it). However "Parallel Prefix Sum (Scan) with CUDA" (available here: http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html) goes into details how padding should be added to scan algorithm because of bank conflicts.

我的问题是,这个算法的基本类型是float,其大小为4个字节。因此,每个浮动是一个银行,没有银行冲突。

The problem for me is, the basic type for this algorithm as presented is float and its size is 4 bytes. Thus each float is a bank and there is no bank conflict.

所以是我的理解正确 - 即如果你工作在4 * N字节类型你不必须担心银行冲突,因为根据定义,将没有?如果没有,我应该如何理解它(何时使用填充)?

So is my understanding correct -- i.e. if you work on 4*N-byte types you don't have to worry about bank conflicts because by definition there will be none? If no, how should I understand it (when to use padding)?

推荐答案

您可能感兴趣此在线讲座 NVIDIA CUDA在线讲座页面共享内存包括银行也在幻灯片35-45从此在线讲座

You might be interested in this webinar from the NVIDIA CUDA webinar page Shared memory including banks are described also on slides 35-45 from this webinar.

一般来说,共享内存库冲突可能发生在任何时候两个不同的线程(从相同的内核指令)访问共享存储器内的地址的较低4(pre-cc2.0设备)或5位(cc2.0和较新设备)相同的位置。当发生存储体冲突时,共享存储器系统将对位于同一存储体中的位置的访问串行化,从而降低性能。填充尝试避免这种访问模式。注意,对于cc2.0和更新版本,如果所有的位都是相同的(即相同的位置),这不会导致bank冲突。

In general shared memory bank conflicts can occur any time two different threads are attempting to access (from the same kernel instruction) locations within shared memory for which the lower 4 (pre-cc2.0 devices) or 5 bits (cc2.0 and newer devices) of the address are the same. When a bank conflict does occur, the shared memory system serializes accesses to locations that are in the same bank, thus reducing performance. Padding attempts to avoid this for some access patterns. Note that for cc2.0 and newer, if all the bits are the same (i.e. same location) this does not cause a bank conflict.

它像这样:

__shared__ int A[2048];
int my;
my = A[0]; // A[0] is in bank 0
my = A[1]; // A[1] is in bank 1
my = A[2]; // A[2] is in bank 2
...
my = A[31]; // A[31] is in bank 31 (cc2.0 or newer device)
my = A[32]; // A[32] is in bank 0
my = A[33]; // A[33] is in bank 1

现在,如果我们通过线程访问共享内存我们可能会遇到银行冲突:

now, if we access shared memory across threads in a warp, we may hit bank conflicts:

my = A[threadIdx.x];    // no bank conflicts or serialization - handled in one trans.
my = A[threadIdx.x*2];  // 2-way bank conflicts - will cause 2 level serialization
my = A[threadIdx.x*32]; // 32-way bank conflicts - will cause 32 level serialization

让我们仔细看看2上面银行冲突。因为我们将 threadIdx.x 乘以2,线程0访问库0中的位置0,而线程16访问位于库0中的位置32,也是 ,从而造成银行冲突。对于上面的32路示例,所有地址对应于存储区0.因此,32个事务到共享存储器必须满足这个请求,因为它们都是序列化的。

Let's take a closer look at the 2-way bank conflict above. Since we are multiplying threadIdx.x by 2, thread 0 accesses location 0 in bank 0 but thread 16 accesses location 32 which is also in bank 0, thus creating a bank conflict. For the 32-way example above, all the addresses correspond to bank 0. Thus 32 transactions to shared memory must occur to satisfy this request, as they are all serialized.

因此,为了回答问题,如果我知道我的访问模式将是这样的例子:

So to answer the question, if I knew that my access patterns would be like this for example:

my = A[threadIdx.x*32]; 

然后我可能需要pad数据存储,使 A [32] / code>, A [64]
然后我可以获取相同的数据,像这样:

Then I might want pad my data storage so that A[32] is a dummy/pad location, as is A[64], A[96] etc. Then I could fetch the same data like this:

my = A[threadIdx.x*33]; 

获取我的数据,没有银行冲突。

And get my data with no bank conflicts.

希望这有助。

这篇关于什么时候为共享内存填充真的需要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆