CUDA中具有共享内存的非方阵转置 [英] Non-square matrix transpose with shared mem in CUDA

查看:164
本文介绍了CUDA中具有共享内存的非方阵转置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图获取各种尺寸的SDK矩阵转置样本的变体.简而言之,我必须采用一个输入数组(双* a)并将其写在较大矩阵(双* tab)的两个不同部分(您会注意到不同的偏移量)上.我以行优先格式存储数据,所以我使用此宏进行索引:

#define IDX2L(i,j,ld) (((i)*ld))+(j)) // 0 based index +row-major format

这是我使用的简单代码.

__global__ void cuda_a_Coalesced(double *tab, int tab_rows, int a_rows, double *a)
{
    __shared__  double tile[16*(16+1)]; 
    int col = threadIdx.x + blockIdx.x * blockDim.x;
    int row = threadIdx.y + blockIdx.y * blockDim.y;

    int col_2, row_2;
    int a_cols=tab_rows-a_rows; // tab_rows-a_rows is the number of columns of a
    int tab_cols=2*tab_rows+2;  // 2*tab_rows+2 is the number of columns of tab

    if( (col<a_cols) && (row<a_rows) ) 
    {
        // Load the data into shared mem
        tile[threadIdx.x+threadIdx.y*(16+1)]=a[IDX2L(row,col,a_cols)];

        // Normal copy (+ offsets)
        tab[IDX2L(row,col+tab_rows+a_rows,tab_cols)]= tile[threadIdx.x+threadIdx.y*(16+1)];

        // New idx
        col_2 = blockIdx.y * blockDim.y + threadIdx.x;
        row_2 = blockIdx.x * blockDim.x + threadIdx.y;
    }
    __syncthreads();

    if( (row_2<a_cols) && (col_2<a_rows) )
        // Transpose (+ other offsets)
        tab[IDX2L(row_2+a_rows,col_2+tab_rows,tab_cols)]= -tile[threadIdx.y+threadIdx.x*(16+1)];

}

启动参数如下:

b1=(int)ceil((float)a_cols/16);
b2=(int)ceil((float)a_rows/16);
dim bck(b1,b2):dim th(16,16);

cuda_a_Coalesced<<<bck,th>>>(tab,tab_rows,a_rows,a);

无论尺寸大小,普通复印总是表现良好.转置副本仅适用于块大小的整数倍的大小(如在SDK示例中一样).当转置副本失败时,某些操作是正确的,而其他操作则不正确,这是我无法准确预测或跟踪的方式.请注意,因为其想法是更改共享内存中的索引,以便可以以合并形式(在行主格式下)将转置形式写入输出矩阵.

有人可以告诉我为什么代码只能以这种大小工作吗?

有什么技巧可以解决这种情况吗?

解决方案

问题是由于一些不确定的线程引起的,因为col_2和row_2的值是在没有所有线程正在访问的if()语句中分配的.

为解决这种情况,我们可以在声明这些变量并删除在上述if()中放置的同构计算时,为col_2和row_2赋值:

__shared__  double tile[16*(16+1)];

int col = threadIdx.x + blockIdx.x * blockDim.x;
int row = threadIdx.y + blockIdx.y * blockDim.y;

int col_2 = blockIdx.y * blockDim.y + threadIdx.x;
int row_2 = blockIdx.x * blockDim.x + threadIdx.y;

int a_cols=tab_rows-a_rows; 
int tab_cols=2*tab_rows+2;

因此,其余代码如下所示:

if( (col<a_cols) && (row<a_rows) ) 
{
    // Load the data into shared mem
    tile[threadIdx.x+threadIdx.y*(16+1)]=a[IDX2L(row,col,a_cols)];
    // Normal copy (+ offsets)
    tab[IDX2L(row,col+tab_rows+a_rows,tab_cols)]= tile[threadIdx.x+threadIdx.y*(16+1)];
}
__syncthreads();

if( (row_2<a_cols) && (col_2<a_rows) )
    // Transpose (+ other offsets)
    tab[IDX2L(row_2+a_rows,col_2+tab_rows,tab_cols)]= -tile[threadIdx.y+threadIdx.x*(16+1)];

I was trying to get a variation of the SDK matrix transpose sample for all kind of sizes. Briefly, I have to take an input array (double *a) and write it on two different parts (you will notice the different offsets) of a bigger matrix (double *tab). I'm storing the data in row-major format so I'm using this macro for indexing:

#define IDX2L(i,j,ld) (((i)*ld))+(j)) // 0 based index +row-major format

This is the simple code I use.

__global__ void cuda_a_Coalesced(double *tab, int tab_rows, int a_rows, double *a)
{
    __shared__  double tile[16*(16+1)]; 
    int col = threadIdx.x + blockIdx.x * blockDim.x;
    int row = threadIdx.y + blockIdx.y * blockDim.y;

    int col_2, row_2;
    int a_cols=tab_rows-a_rows; // tab_rows-a_rows is the number of columns of a
    int tab_cols=2*tab_rows+2;  // 2*tab_rows+2 is the number of columns of tab

    if( (col<a_cols) && (row<a_rows) ) 
    {
        // Load the data into shared mem
        tile[threadIdx.x+threadIdx.y*(16+1)]=a[IDX2L(row,col,a_cols)];

        // Normal copy (+ offsets)
        tab[IDX2L(row,col+tab_rows+a_rows,tab_cols)]= tile[threadIdx.x+threadIdx.y*(16+1)];

        // New idx
        col_2 = blockIdx.y * blockDim.y + threadIdx.x;
        row_2 = blockIdx.x * blockDim.x + threadIdx.y;
    }
    __syncthreads();

    if( (row_2<a_cols) && (col_2<a_rows) )
        // Transpose (+ other offsets)
        tab[IDX2L(row_2+a_rows,col_2+tab_rows,tab_cols)]= -tile[threadIdx.y+threadIdx.x*(16+1)];

}

The launching parameters are the followings:

b1=(int)ceil((float)a_cols/16);
b2=(int)ceil((float)a_rows/16);
dim bck(b1,b2):dim th(16,16);

cuda_a_Coalesced<<<bck,th>>>(tab,tab_rows,a_rows,a);

Normal copy is always well performed regardless of the size. Transpose copy only works for that sizes that are integer multiple of the block size (as in the SDK sample). When transpose copy fails, some parts of the operations are right and others not, on a way that I can not exactly predict or track. Note as the idea is to change the index in the shared memory so that the transpose can be written in a coalesced form in the output matrix (due to row major-format).

Someone could tell me the reason why the code only works with that kind of sizes?

Is there any trick to solve this situation?

解决方案

The problem was due to some indefined threads because the value for col_2 and row_2 was being assigned within an if() statement that no all threads were visiting.

To solve this situation we can give the value for col_2 and row_2 when we declare these variables and delete the homonimous compute that had place within the mentioned if():

__shared__  double tile[16*(16+1)];

int col = threadIdx.x + blockIdx.x * blockDim.x;
int row = threadIdx.y + blockIdx.y * blockDim.y;

int col_2 = blockIdx.y * blockDim.y + threadIdx.x;
int row_2 = blockIdx.x * blockDim.x + threadIdx.y;

int a_cols=tab_rows-a_rows; 
int tab_cols=2*tab_rows+2;

Thus, the rest of the code looks like this:

if( (col<a_cols) && (row<a_rows) ) 
{
    // Load the data into shared mem
    tile[threadIdx.x+threadIdx.y*(16+1)]=a[IDX2L(row,col,a_cols)];
    // Normal copy (+ offsets)
    tab[IDX2L(row,col+tab_rows+a_rows,tab_cols)]= tile[threadIdx.x+threadIdx.y*(16+1)];
}
__syncthreads();

if( (row_2<a_cols) && (col_2<a_rows) )
    // Transpose (+ other offsets)
    tab[IDX2L(row_2+a_rows,col_2+tab_rows,tab_cols)]= -tile[threadIdx.y+threadIdx.x*(16+1)];

这篇关于CUDA中具有共享内存的非方阵转置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆