如何在CUDA中安排2D共享内存 [英] How is 2D Shared Memory arranged in CUDA

查看:298
本文介绍了如何在CUDA中安排2D共享内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直使用线性共享内存(加载,存储,访问邻居),但我做了一个简单的测试在2D中研究银行冲突,结果困惑了我。



下一个代码将数据从一维全局内存数组读取到共享内存,并将其从共享内存复制回全局内存。

  __ global__ void update(int * gIn,int * gOut,int w){

//共享内存空间
__shared__ int shData [16] [16];
//从threadIdx / BlockIdx到数据位置的映射
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
//计算一维数组中的全局id
int gid = x + y * w;

//加载共享内存
shData [threadIdx.x] [threadIdx.y] = gIn [gid];
//同步线程不是真的需要,但为了方便
__syncthreads();
//将数据写回全局内存
gOut [gid] = shData [threadIdx.x] [threadIdx.y];可视化分析工具报告共享内存冲突 下一个代码避免冲突(只显示差异)

  //加载共享内存
shData [threadIdx.y ] [threadIdx.x] = gIn [gid];

//将数据写回全局内存
gOut [gid] = shData [threadIdx.y] [threadIdx.x];

这种行为让我困惑,因为在大规模并行处理器编程。我们可以阅读以下实践方法:根据行主要约定,将C和CUDA中的矩阵元素放置到线性寻址的位置中:


< 。也就是说,矩阵的第0行的元素首先按顺序放置在连续的位置。


这与共享内存段?或线程索引?也许我错过了什么?



内核配置如下:

  // kernel configuration 
dim3 dimBlock = dim3(16,16,1);
dim3 dimGrid = dim3(64,64);
//使用16x16线程启动64x64块的网格 - > 1048576 threads
update<<< dimgrid,dimBlock>>>>(d_input,d_output,1024);

提前感谢。

方案

是的,共享内存按照您所期望的顺序排列。所以你的[16] [16]数组是按行存储的,像这样:

  bank0 ... bank15 
row 0 [0 .... 15]
1 [16 .... 31]
2 [32 ... 47]
3 [48 .... 63 ]
4 [64 .... 79]
5 [80 .... 95]
6 [96 .... 111]
7 [112 ... 。127]
8 [128 .... 143]
9 [144 .... 159]
10 [160 .... 175]
11 [176。 ... 191]
12 [192 ... 207]
13 [208 .... 223]
14 [224 .... 239]
15 [ 240 .... 255]
col 0 .... col 15

是在pre-Fermi硬件上的16个32位共享存储器组,每个列中的每个整数条目映射到一个共享存储器组。那么,如何与您选择的索引方案相互作用?



要记住的一点是,一个块中的线程编号为等于列主顺序结构的x维度是最快的变化,其次是y,其次是z)。因此,当您使用此索引方案时:

  shData [threadIdx.x] [threadIdx.y] 
  shData [threadIdx.y] [threadIdx.x] 

线程将从相同的行读取,这意味着从16个不同的共享存储器组中的每一个读取,发生冲突。


I've always worked with linear shared memory (load, store, access neighbours) but I've made a simple test in 2D to study bank conflicts which results have confused me.

The next code read data from one dimensional global memory array to shared memory and copy it back from shared memory to global memory.

__global__ void update(int* gIn, int* gOut, int w) {

// shared memory space
__shared__ int shData[16][16];
// map from threadIdx/BlockIdx to data position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global id into the one dimensional array
int gid = x + y * w;

// load shared memory
shData[threadIdx.x][threadIdx.y] = gIn[gid];
// synchronize threads not really needed but keep it for convenience
__syncthreads();
// write data back to global memory
gOut[gid] = shData[threadIdx.x][threadIdx.y];
}

The visual profiler reported conflicts in shared memory. The next code avoid thouse conflicts (only show the differences)

// load shared memory
shData[threadIdx.y][threadIdx.x] = gIn[gid];

// write data back to global memory
gOut[gid] = shData[threadIdx.y][threadIdx.x];

This behavior has confused me because in Programming Massively Parallel Processors. A Hands-on approach we can read:

matrix elements in C and CUDA are placed into the linearly addressed locations according to the row major convention. That is, the elements of row 0 of a matrix are first placed in order into consecutive locations.

Is this related to shared memory arragment? or with threads indexes? Maybe am I missing something?

The kernel configuration is as follow:

// kernel configuration
dim3 dimBlock  = dim3 ( 16, 16, 1 );
dim3 dimGrid   = dim3 ( 64, 64 );
// Launching a grid of 64x64 blocks with 16x16 threads -> 1048576 threads
update<<<dimGrid, dimBlock>>>(d_input, d_output, 1024);

Thanks in advance.

解决方案

Yes, shared memory is arranged in row-major order as you expected. So your [16][16] array is stored row wise, something like this:

       bank0 .... bank15
row 0  [ 0   .... 15  ]
    1  [ 16  .... 31  ]
    2  [ 32  .... 47  ]
    3  [ 48  .... 63  ]
    4  [ 64  .... 79  ]
    5  [ 80  .... 95  ]
    6  [ 96  .... 111 ]
    7  [ 112 .... 127 ]
    8  [ 128 .... 143 ]
    9  [ 144 .... 159 ]
    10 [ 160 .... 175 ]
    11 [ 176 .... 191 ]
    12 [ 192 .... 207 ]
    13 [ 208 .... 223 ]
    14 [ 224 .... 239 ]
    15 [ 240 .... 255 ]
       col 0 .... col 15

Because there are 16 32 bit shared memory banks on pre-Fermi hardware, every integer entry in each column maps onto one shared memory bank. So how does that interact with your choice of indexing scheme?

The thing to keep in mind is that threads within a block are numbered in the equivalent of column major order (technically the x dimension of the structure is the fastest varying, followed by y, followed by z). So when you use this indexing scheme:

shData[threadIdx.x][threadIdx.y]

threads within a half-warp will be reading from the same column, which implies reading from the same shared memory bank, and bank conflicts will occur. When you use the opposite scheme:

shData[threadIdx.y][threadIdx.x]

threads within the same half-warp will be reading from the same row, which implies reading from each of the 16 different shared memory banks, no conflicts occur.

这篇关于如何在CUDA中安排2D共享内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆