CUDA 5.0内存对齐和合并访问 [英] CUDA 5.0 Memory alignment and coalesced access

查看:361
本文介绍了CUDA 5.0内存对齐和合并访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个2D主机数组,有10行和96列。我将这个数组线性地加载到我的cuda设备全局内存中,即row1,row2,row3 ... row10。

I have a 2D host array with 10 rows and 96 columns. I load this array to my cuda device global memory linearly i.e. row1, row2, row3 ... row10.

数组类型为float。在我的内核中,每个线程从设备全局内存中访问一个浮点值。

The array is of type float. In my kernel each thread accesses one float value from the device global memory.

 The BLOCK_SIZE I use is = 96
 The GRID_DIM I use is = 10



现在我从Cuda C编程指南访问,我使用的模式是正确的,通过warp连续访问内存位置。但是有一个关于内存128字节内存对齐的条款。我不明白。

Now what I understood from the "Cuda C programming guide" for coalesced accesses, the pattern I am using is correct, access consecutively memory location by warp. But there is a clause about memory 128 byte memory alignment. Which I fail to understand.

Q1)128字节内存对齐;这是否意味着warp中的每个线程应该访问从地址0x00(例如)开始直到0x80的4个字节?

Q1) 128 bytes memory alignment; Does it mean that each thread in a warp should access 4 bytes starting from an address 0x00 (for example) till 0x80?

Q2在这种情况下,进行非聚合访问?

Q2) So in the scenario, will I be making uncoalesced accesses or not?

我的理解是:一个线程应该使一个内存访问应该是4个字节,从地址范围,例如从0x00到0x80。如果来自warp的线程访问它外部的位置,那么它是一个非聚集的访问。

My understanding is: one thread should make one memory access with should be 4 bytes, from range of address such as from 0x00 to 0x80. If a thread from a warp accesses a location outside it, its an uncoalesced access.

推荐答案

128字节的块,在128字节边界对齐。合并内存访问意味着您保持所有访问从您的warp到128个字节的一个块。 (在较旧的卡中,内存必须按照线程ID的顺序访问,但较新的卡不再有此要求。)

Loads from global memory are usually done in chunks of 128 bytes, aligned on 128 byte boundaries. Coalesced memory access means that you keep all accesses from your warp to one chunk of 128 bytes. (In older cards, the memory had to be accessed in order of thread id, but newer cards no longer have this requirement.)

如果您的经线中有32个线程读取一个float,你将从全局内存中读取总共128个字节。如果内存对齐正确,所有读取将来自同一块。如果对齐关闭,您将需要两次读取。如果你做了 a [32 * i] ,那么每次访问将来自全局内存中不同的128字节块,这将是非常缓慢的。

If the 32 threads in your warp each read a float, you will read a total of 128 bytes from global memory. If the memory is aligned correctly, all reads will be from the same block. If alignment is off, you'll need two reads. If you do something like a[32*i], then each access will come from a different 128 byte block in global memory, which will be very slow.

无论您访问哪个块,只要warp中的所有线程都访问同一个块。

It doesn't matter which block you access, as long as all threads in a warp access the same block.

如果你有一个96浮点数组,然后如果每个线索索引 i 在你的warp访问 a [i] 它将是一个合并的读。与 a [i + 32] a [i + 64] 相同。

If you have an array of 96 floats, then if each thread with index i in your warp accesses a[i], it will be a coalesced read. Same with either a[i+32] or a[i+64].

因此,Q1的答案是所有线程都需要保持在128字节长度的同一块中,并在128字节边界对齐。

So, the answer to Q1 is that all threads need to stay within the same block of length 128 bytes aligned on 128 byte boundaries.

你的Q2的答案是,如果你的数组对齐正确,你的访问是 a [32 * x + i] i 线程ID和 x 所有线程都相同的任何整数,您的访问将合并。

The answer to your Q2 is that if your arrays are aligned correctly, and your accesses are of the form a[32*x+i] with i the thread id and x any integer that is the same for all threads, your accesses will be coalesced.

根据编程指南的第5.3.2.1.1节,内存总是在至少256字节边界对齐,因此使用 cudaMalloc 创建的数组总是

According to Section 5.3.2.1.1 of the programming guide, memory is always aligned on at least 256 byte boundaries, so arrays created with cudaMalloc are always aligned correctly.

这篇关于CUDA 5.0内存对齐和合并访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆