何时真正需要填充共享内存? [英] When is padding for shared memory really required?

查看:13
本文介绍了何时真正需要填充共享内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对来自 NVidia 的 2 个文档感到困惑.CUDA 最佳实践"描述了共享内存是按银行组织的,通常在 32 位模式下,每 4 个字节就是一个银行(这就是我的理解).然而 Parallel Prefix Sum (Scan) with CUDA 详细介绍了应该如何填充由于银行冲突,添加到扫描算法中.

I am confused by 2 documents from NVidia. "CUDA Best Practices" describes that shared memory is organized in banks, and in general in 32-bit mode each 4 bytes is a bank (that is how I understood it). However Parallel Prefix Sum (Scan) with CUDA goes into details how padding should be added to scan algorithm because of bank conflicts.

对我来说问题是,这个算法的基本类型是浮点数,它的大小是 4 个字节.因此每个浮点数都是一个银行,没有银行冲突.

The problem for me is, the basic type for this algorithm as presented is float and its size is 4 bytes. Thus each float is a bank and there is no bank conflict.

所以我的理解是否正确——也就是说,如果您使用 4*N 字节类型,您不必担心银行冲突,因为根据定义不会有冲突?如果不是,我应该如何理解(何时使用填充)?

So is my understanding correct -- i.e. if you work on 4*N-byte types you don't have to worry about bank conflicts because by definition there will be none? If no, how should I understand it (when to use padding)?

推荐答案

您可能对NVIDIA CUDA 网络研讨会页面中的 noreferrer">本次网络研讨会 共享内存,包括存储库本次网络研讨会的幻灯片 35-45 中也有描述.

You might be interested in this webinar from the NVIDIA CUDA webinar page Shared memory including banks are described also on slides 35-45 from this webinar.

一般来说,当两个不同的线程试图访问(从相同的内核指令)共享内存中的低 4 位(cc2.0 之前的设备)或 5 位(cc2.0 之前的设备)或 5 位(cc2.0 和更新的设备)的地址相同.当确实发生存储库冲突时,共享内存系统会串行访问同一存储库中的位置,从而降低性能.对于某些访问模式,填充试图避免这种情况.请注意,对于 cc2.0 及更高版本,如果所有位都相同(即相同位置),则不会导致存储库冲突.

In general shared memory bank conflicts can occur any time two different threads are attempting to access (from the same kernel instruction) locations within shared memory for which the lower 4 (pre-cc2.0 devices) or 5 bits (cc2.0 and newer devices) of the address are the same. When a bank conflict does occur, the shared memory system serializes accesses to locations that are in the same bank, thus reducing performance. Padding attempts to avoid this for some access patterns. Note that for cc2.0 and newer, if all the bits are the same (i.e. same location) this does not cause a bank conflict.

从图形上看,我们可以这样看:

Pictorially, we can look at it like this:

__shared__ int A[2048];
int my;
my = A[0]; // A[0] is in bank 0
my = A[1]; // A[1] is in bank 1
my = A[2]; // A[2] is in bank 2
...
my = A[31]; // A[31] is in bank 31 (cc2.0 or newer device)
my = A[32]; // A[32] is in bank 0
my = A[33]; // A[33] is in bank 1

现在,如果我们在 warp 中跨线程访问共享内存,我们可能会遇到银行冲突:

now, if we access shared memory across threads in a warp, we may hit bank conflicts:

my = A[threadIdx.x];    // no bank conflicts or serialization - handled in one trans.
my = A[threadIdx.x*2];  // 2-way bank conflicts - will cause 2 level serialization
my = A[threadIdx.x*32]; // 32-way bank conflicts - will cause 32 level serialization

让我们仔细看看上面的 2-way bank 冲突.由于我们将 threadIdx.x 乘以 2,线程 0 访问了 bank 0 中的位置 0,但线程 16 访问了位置 32,该位置 在 bank 0 中,从而产生了 bank 冲突.对于上面的 32 路示例,所有地址都对应于银行 0.因此,必须发生 32 次共享内存事务才能满足此请求,因为它们都是序列化的.

Let's take a closer look at the 2-way bank conflict above. Since we are multiplying threadIdx.x by 2, thread 0 accesses location 0 in bank 0 but thread 16 accesses location 32 which is also in bank 0, thus creating a bank conflict. For the 32-way example above, all the addresses correspond to bank 0. Thus 32 transactions to shared memory must occur to satisfy this request, as they are all serialized.

所以回答这个问题,如果我知道我的访问模式会是这样的,例如:

So to answer the question, if I knew that my access patterns would be like this for example:

my = A[threadIdx.x*32]; 

然后我可能想要填充我的数据存储,以便 A[32] 是一个虚拟/填充位置,就像 A[64], A[96] 等然后我可以像这样获取相同的数据:

Then I might want pad my data storage so that A[32] is a dummy/pad location, as is A[64], A[96] etc. Then I could fetch the same data like this:

my = A[threadIdx.x*33]; 

并在没有银行冲突的情况下获取我的数据.

And get my data with no bank conflicts.

希望这会有所帮助.

这篇关于何时真正需要填充共享内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆