GPU / CUDA:重新排序设备内存 [英] GPU/CUDA: Re-ordering device memory

查看:295
本文介绍了GPU / CUDA:重新排序设备内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有存储在设备内存多维数组。我想置换 /的,也就是说,根据新的重新安排它的元素为了尺寸。

I have a multi dimension array stored in device memory. I want to "permute"/"transpose", that is, re-arrange its elements according to new order of dimensions.

例如,如果我有一个二维数组

For example, if I have a 2D array

A = [0, 1, 2
     3, 4, 5]

我想改变尺寸的顺序,所以我得到

I want to change the order of dimension so I get

B = [0, 3
     1, 4
     2, 5]

这重新排序几乎副本存储在内存中的顺序 [0,1,2,3,4,5] 元素,并返回一个新的排序 [0,3,1,4,2,5]

This re-ordering practically copies the elements that are stored in memory in the order [0,1,2,3,4,5] and return a new ordering [0,3,1,4,2,5].

我知道如何从 A 的索引映射到 B ,我的问题是我怎么能执行这个映射有效地利用CUDA的设备?

I know how to map the indices from A to B, my question is how I can execute this mapping efficiently on device using cuda?

推荐答案

您可以检查这个的 http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/

天真矩阵转置:

__global__ void transposeNaive(float *odata, const float *idata)
{
  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
    odata[x*width + (y+j)] = idata[(y+j)*width + x];
}

聚结移调通过共享内存:

__global__ void transposeCoalesced(float *odata, const float *idata)
{
  __shared__ float tile[TILE_DIM][TILE_DIM];

  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];

  __syncthreads();

  x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
  y = blockIdx.x * TILE_DIM + threadIdx.y;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}

这篇关于GPU / CUDA:重新排序设备内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆