CUDA 合并访问全局内存 [英] CUDA coalesced access to global memory

查看:44
本文介绍了CUDA 合并访问全局内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读 CUDA 编程指南,但我错过了一件事.假设我在全局内存中有 32 位 int 数组,我想通过合并访问将它复制到共享内存.全局数组的索引从 0 到 1024,假设我有 4 个块,每个块有 256 个线程.

I have read CUDA programming guide, but i missed one thing. Let's say that i have array of 32bit int in global memory and i want to copy it to shared memory with coalesced access. Global array has indexes from 0 to 1024, and let's say i have 4 blocks each with 256 threads.

__shared__ int sData[256];

何时执行合并访问?

1.

sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y];

全局内存中的地址从 0 复制到 255,每个被 32 个线程在 warp 中复制,这样就可以了?

Adresses in global memory are copied from 0 to 255, each by 32 threads in warp, so here it's ok?

2.

sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y + someIndex];

如果 someIndex 不是 32 的倍数,它不会合并?地址错位?对吗?

If someIndex is not multiple of 32 it is not coalesced? Misaligned adresses? Is that correct?

推荐答案

你想要什么最终取决于你的输入数据是一维数组还是二维数组,以及你的网格和块是一维还是二维.最简单的情况都是一维的:

What you want ultimately depends on whether your input data is a 1D or 2D array, and whether your grid and blocks are 1D or 2D. The simplest case is both 1D:

shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + threadIdx.x];

这是合并的.我使用的经验法则是将变化最快的坐标(threadIdx)作为偏移量添加到块偏移量(blockDim * blockIdx)上.最终结果是块中线程之间的索引步长为 1.如果步长变大,那么您将失去合并.

This is coalesced. The rule of thumb I use is that the most rapidly varying coordinate (the threadIdx) is added on as offset to the block offset (blockDim * blockIdx). The end result is that the indexing stride between threads in the block is 1. If the stride gets larger, then you lose coalescing.

简单的规则(在 Fermi 和更高版本的 GPU 上)是,如果一个 warp 中所有线程的地址落入相同对齐的 128 字节范围内,则将产生单个内存事务(假设为加载启用了缓存,这是默认值).如果它们落入两个对齐的 128 字节范围内,则会产生两个内存事务,等等.

The simple rule (on Fermi and later GPUs) is that if the addresses for all threads in a warp fall into the same aligned 128-byte range, then a single memory transaction will result (assuming caching is enabled for the load, which is the default). If they fall into two aligned 128-byte ranges, then two memory transactions result, etc.

在 GT2xx 和更早的 GPU 上,它变得更加复杂.但是您可以在编程指南中找到详细信息.

On GT2xx and earlier GPUs, it gets more complicated. But you can find the details of that in the programming guide.

其他示例:

未合并:

shmem[threadIdx.x] = gmem[blockDim.x + blockIdx.x * threadIdx.x];

没有合并,但在 GT200 及更高版本上还不错:

Not coalesced, but not too bad on GT200 and later:

stride = 2;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];

根本没有合并:

stride = 32;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];

合并的、2D 网格、1D 块:

Coalesced, 2D grid, 1D block:

int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.x] = gmem[blockIdx.y * elementPitch + 
                          blockIdx.x * blockDim.x + threadIdx.x]; 

合并的二维网格和块:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.y * blockDim.x + threadIdx.x] = gmem[y * elementPitch + x];

这篇关于CUDA 合并访问全局内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆