Cuda C上具有任意大小的矩阵转置(具有共享内存) [英] Matrix Transpose (with shared Memory) with arbitary size on Cuda C

查看:478
本文介绍了Cuda C上具有任意大小的矩阵转置(具有共享内存)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想不出一种方法来使用CUDA C中的共享内存来转置非平方矩阵.(我是CUDA C和C的新手)

I can't figure out a way to transpose a non-squared matrix using shared memory in CUDA C. (I am new to CUDA C and C)

在网站上:

https://devblogs.nvidia.com/efficiency-matrix- transpose-cuda-cc/

演示了一种有效的方法来转置矩阵(通过共享内存进行Coalesced转置).但这仅适用于平方矩阵.

an efficient way was shown how to transpose a matrix (Coalesced Transpose Via Shared Memory). But it only works for squared matrices.

github (与博客相同).

在Stackoverflow上,有一个类似的问题 .设置了TILE_DIM = 16.但是通过该实现,每个线程只需将矩阵的一个元素复制到结果矩阵.

On Stackoverflow there is a similar question. There TILE_DIM = 16 is set. But with that implementation every thread just copies one element of the matrix to the result matrix.

这是我当前的实现方式

__global__ void transpose(double* matIn, double* matTran, int n, int m){
    __shared__ double tile[TILE_DIM][TILE_DIM];
    int i_n = blockIdx.x*TILE_DIM + threadIdx.x;
    int i_m = blockIdx.y*TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7

    // Load matrix into tile
    // Every Thread loads in this case 4 elements into tile.
    int i;
    for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
        if(i_n < n  && (i_m+i) < m){
            tile[threadIdx.y+i][threadIdx.x] = matIn[n*(i_m+i) + i_n];
        } else {
            tile[threadIdx.y+i][threadIdx.x] = -1; 
        }
    }
    __syncthreads();

    for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
        if(tile[threadIdx.x][threadIdx.y+i] != -1){ // <- is there a better way?
            if(true){      // <- what should be checked here?
                matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
            } else {
                matTran[m*i_n + (i_m+i)] = tile[threadIdx.x][threadIdx.y+i];
            }
        }
    }
}

其中4个元素从线程复制到图块中.拼贴中的四个元素也被复制回结果矩阵.

where 4 elements are copied from a thread into the tile. Also four elements from the tile are copied back into the result matrix.

这里是内核配置<<<a, b>>>:

where a: (ceil(n/TILE_DIM), ceil(n/TILE_DIM))  (-> is casted to doubles) and 
      b: (TILE_DIM, BLOCK_ROWS) (-> (32, 8))

我当前正在使用if(tile[threadIdx.x][threadIdx.y+i] != -1)语句来确定哪个线程应复制到结果矩阵(可能有另一种方式).据我所知,它的行为如下:在一个块中,ThreadIdx (x, y)将数据复制到图块中,而ThreadIdx (y, x)将数据复制回结果矩阵中.

I am currently using the if(tile[threadIdx.x][threadIdx.y+i] != -1)-statement to determine, which thread should copy to the result matrix (There might be another way). As for my current knowledge, this behaves as follows: In a block, the ThreadIdx (x, y) copies the data into the tile and the ThreadIdx (y, x) copies the data back into the result matrix.

我插入了另一个if语句来确定将数据复制到哪里,因为有2(?)个可能的目的地,具体取决于ThreadIdx.当前true插入在那里,但是我尝试了许多不同的方法.我能想到的最好的方法是if(threadIdx.x+1 < threadIdx.y+i),它成功地将3x2-矩阵转置.

I inserted another if-statement to determine where to copy the data, as there are 2(?) possible destinations, depending on the ThreadIdx. Currently true is inserted there, but i tried many different things. The best i could come up with was if(threadIdx.x+1 < threadIdx.y+i), which transposes a 3x2-matrix succesfully.

有人可以通过写回结果矩阵来解释我所缺少的吗?显然,只有一个目的地是正确的.使用

Can someone please explain, what i am missing by writing back into the result matrix? Obviously only one destination is correct. Using

matTran [n *(i_m + i)+ i_n] = tile [threadIdx.x] [threadIdx.y + i];

matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];

在博客上提到的应该是正确的,但我不知道为什么它不适用于非平方矩阵?

as on the blog mentioned should be correct, but I can't figure out, why it is not working for non-squared matrices?

推荐答案

我使问题复杂化了. 在这里,按我的想法,索引没有被交换.使用线程/块的Y坐标和X坐标重新计算它们.这是代码段:

I was overcomplicating the problem. Here, the indeces are NOT swapped as i thought. They are recalculated using the Y- and X-Coordinate of the Thread/Block. Here is the snippet:

i_n = blockIdx.y * TILE_DIM + threadIdx.x;  
i_m = blockIdx.x * TILE_DIM + threadIdx.y

这是更正的代码:

__global__ void transposeGPUcoalescing(double* matIn, int n, int m, double* matTran){
    __shared__ double tile[TILE_DIM][TILE_DIM];
    int i_n = blockIdx.x * TILE_DIM + threadIdx.x;
    int i_m = blockIdx.y * TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7

    // Load matrix into tile
    // Every Thread loads in this case 4 elements into tile.
    int i;
    for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
        if(i_n < n  && (i_m+i) < m){
            tile[threadIdx.y+i][threadIdx.x] = matIn[(i_m+i)*n + i_n];
        }
    }
    __syncthreads();

    i_n = blockIdx.y * TILE_DIM + threadIdx.x; 
    i_m = blockIdx.x * TILE_DIM + threadIdx.y;

    for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
        if(i_n < m  && (i_m+i) < n){
            matTran[(i_m+i)*m + i_n] = tile[threadIdx.x][threadIdx.y + i]; // <- multiply by m, non-squared!

        }
    }
}

感谢注意到该错误的评论:)

Thanks to this comment for noticing the error :)

这篇关于Cuda C上具有任意大小的矩阵转置(具有共享内存)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆