GPU / CUDA:重新排序设备内存 [英] GPU/CUDA: Re-ordering device memory
问题描述
我有存储在设备内存多维数组。我想置换 /的转,也就是说,根据新的重新安排它的元素为了尺寸。
I have a multi dimension array stored in device memory. I want to "permute"/"transpose", that is, re-arrange its elements according to new order of dimensions.
例如,如果我有一个二维数组
For example, if I have a 2D array
A = [0, 1, 2
3, 4, 5]
我想改变尺寸的顺序,所以我得到
I want to change the order of dimension so I get
B = [0, 3
1, 4
2, 5]
这重新排序几乎副本存储在内存中的顺序 [0,1,2,3,4,5]
元素,并返回一个新的排序 [0,3,1,4,2,5]
。
This re-ordering practically copies the elements that are stored in memory in the order [0,1,2,3,4,5]
and return a new ordering [0,3,1,4,2,5]
.
我知道如何从 A
的索引映射到 B
,我的问题是我怎么能执行这个映射有效地利用CUDA的设备?
I know how to map the indices from A
to B
, my question is how I can execute this mapping efficiently on device using cuda?
推荐答案
您可以检查这个的 http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/
天真矩阵转置:
__global__ void transposeNaive(float *odata, const float *idata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[x*width + (y+j)] = idata[(y+j)*width + x];
}
__global__ void transposeCoalesced(float *odata, const float *idata)
{
__shared__ float tile[TILE_DIM][TILE_DIM];
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];
__syncthreads();
x = blockIdx.y * TILE_DIM + threadIdx.x; // transpose block offset
y = blockIdx.x * TILE_DIM + threadIdx.y;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}
这篇关于GPU / CUDA:重新排序设备内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!