CUDA合并对全局内存的访问 [英] CUDA coalesced access to global memory

查看:300
本文介绍了CUDA合并对全局内存的访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读过CUDA编程指南,但我错过了一件事。让我们说,我有32位int在全局内存中的数组,我想把它复制到共享内存与合并访问。
全局数组的索引从0到1024,假设我有4个块,每个块有256个线程。

I have read CUDA programming guide, but i missed one thing. Let's say that i have array of 32bit int in global memory and i want to copy it to shared memory with coalesced access. Global array has indexes from 0 to 1024, and let's say i have 4 blocks each with 256 threads.

__shared__ int sData[256];

何时执行合并访问?

1。

sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y];

全局内存中的地址从0到255复制,每行由32个线程复制,因此这里确定?

Adresses in global memory are copied from 0 to 255, each by 32 threads in warp, so here it's ok?

2。

sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y + someIndex];

如果someIndex不是32的倍数,它不合并?不匹配的地址?是否正确?

If someIndex is not multiple of 32 it is not coalesced? Misaligned adresses? Is that correct?

推荐答案

最终你想要什么取决于你的输入数据是一维还是二维数组,并且块是1D或2D。最简单的情况是1D:

What you want ultimately depends on whether your input data is a 1D or 2D array, and whether your grid and blocks are 1D or 2D. The simplest case is both 1D:

shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + threadIdx.x];

这是合并的。我使用的经验法则是,最快变化的坐标(threadIdx)被添加为偏移到块偏移(blockDim * blockIdx)。最终结果是,程序段中的线程之间的索引步长为1.如果步长变大,则失去合并。

This is coalesced. The rule of thumb I use is that the most rapidly varying coordinate (the threadIdx) is added on as offset to the block offset (blockDim * blockIdx). The end result is that the indexing stride between threads in the block is 1. If the stride gets larger, then you lose coalescing.

简单的规则GPU)是如果一个warp中的所有线程的地址落入相同的对齐的128字节范围,则将导致单个存储器事务(假设对于负载启用缓存,这是默认的)。如果它们落入两个对齐的128字节范围,则会产生两个内存事务等。

The simple rule (on Fermi and later GPUs) is that if the addresses for all threads in a warp fall into the same aligned 128-byte range, then a single memory transaction will result (assuming caching is enabled for the load, which is the default). If they fall into two aligned 128-byte ranges, then two memory transactions result, etc.

在GT2xx和早期的GPU上,它变得更复杂。但是你可以在编程指南中找到它的细节。

On GT2xx and earlier GPUs, it gets more complicated. But you can find the details of that in the programming guide.

其他示例:

未合并:

shmem[threadIdx.x] = gmem[blockDim.x + blockIdx.x * threadIdx.x];

在GT200及更高版本上未合并,但不会太差:

Not coalesced, but not too bad on GT200 and later:

stride = 2;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];

完全不合并:

stride = 32;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];

合并,2D网格,1D区块:

Coalesced, 2D grid, 1D block:

int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.x] = gmem[blockIdx.y * elementPitch + 
                          blockIdx.x * blockDim.x + threadIdx.x]; 

合并,2D网格和块:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.y * blockDim.x + threadIdx.x] = gmem[y * elementPitch + x];

这篇关于CUDA合并对全局内存的访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆