在CUDA中以平铺矩阵复数形式访问矩阵作为其转置 [英] Access an matrix as its tranpose in tiled matrix mutliplication in CUDA

查看:90
本文介绍了在CUDA中以平铺矩阵复数形式访问矩阵作为其转置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试CUDA,我从矩阵乘法的答案中发现了这个内核: https://stackoverflow.com/a/18856054/7867026

I'm currently experimenting with CUDA and i came across this kernel from an answer for matrix multiplication: https://stackoverflow.com/a/18856054/7867026

我希望不执行A * B而是执行A_Transpose * A,但不保存A_Transpose(仅将矩阵A作为内核的输入).我必须正确设置索引,但是我对这种矩阵表示感到困惑.任何帮助将不胜感激.

I want instead of doing A*B to do A_Transpose*A but without saving A_Transpose (only matrix A as an input to kernel). I have to properly set the indexes but I'm confused by this matrix representation. Any help would be appreciated.

推荐答案

您最需要的是此处在第一个链接中,已确定AxAT涉及获取矩阵A的行的内积,而类似的ATxA涉及获取矩阵A的的内积.另请注意对称性说明.在第二个链接中(在编程指南中从该点向下滚动),您将找到一个完整的平铺矩阵乘法.您只需要按列 索引到这两个图块.

In the first link it is identified that AxAT involves taking inner products of rows of matrix A, and similarly ATxA will involve taking inner products of columns of matrix A. Also note the symmetry statement. In the second link (scroll down from that point a bit in the programming guide) you will find a complete tiled matrix multiply. You just need to index into both tiles by column.

这是一个可行的示例,使用了您链接的SO答案中的代码:

Here is a worked example, using the code from the SO answer you linked:

$ cat t1654.cu
#include <iostream>
#include <cstdio>
#include <cstdlib>

const int TILE_DIM = 32;
template <typename T>
__global__ void ATA(const T * __restrict__  A, T * __restrict__  C, int ARows, int ACols)
{
    T CValue = 0;

    int Row = blockIdx.y*TILE_DIM + threadIdx.y;
    int Col = blockIdx.x*TILE_DIM + threadIdx.x;

    __shared__ T As[TILE_DIM][TILE_DIM];
    __shared__ T Bs[TILE_DIM][TILE_DIM];

    for (int k = 0; k < (TILE_DIM + ARows - 1)/TILE_DIM; k++) {

         if (k*TILE_DIM + threadIdx.y < ARows && blockIdx.y*blockDim.y+threadIdx.x < ACols)
             As[threadIdx.y][threadIdx.x] = A[(k*TILE_DIM + threadIdx.y)*ACols + blockIdx.y*blockDim.y+threadIdx.x];
         else
             As[threadIdx.y][threadIdx.x] = 0.0;

         if (k*TILE_DIM + threadIdx.y < ARows && Col < ACols)
             Bs[threadIdx.y][threadIdx.x] = A[(k*TILE_DIM + threadIdx.y)*ACols + Col];
         else
             Bs[threadIdx.y][threadIdx.x] = 0.0;

         __syncthreads();

         for (int n = 0; n < TILE_DIM; ++n)
             CValue += As[n][threadIdx.y] * Bs[n][threadIdx.x];

         __syncthreads();
    }

    if (Row < ACols && Col < ACols)
        C[((blockIdx.y * blockDim.y + threadIdx.y)*ACols) +
           (blockIdx.x * blockDim.x)+ threadIdx.x] = CValue;
}

template <typename T>
__global__ void transpose_naive(const T * __restrict__ in, T * __restrict__ out, const int dim){
  int col = threadIdx.x+blockDim.x*blockIdx.x;
  int row = threadIdx.y+blockDim.y*blockIdx.y;
  if ((col < dim) && (row < dim)) out[col*dim+row] = in[row*dim+col];
}

template <typename T>
__global__ void mm_naive(const T * __restrict__ A, const T * __restrict__ B, T * __restrict__ C, const int rowA, const int colA, const int colB){
  int col = threadIdx.x+blockDim.x*blockIdx.x;
  int row = threadIdx.y+blockDim.y*blockIdx.y;
  if ((row < rowA) && (col < colB)){
    T Cval = 0;
    for (int i = 0; i < colA; i++) Cval += A[row*colA+i]*B[i*colB+col];
    C[row*colB+col] = Cval;}
}


typedef float mt;
int main(){

  mt *d_A, *d_B, *d_C, *h_A, *h_C, *h_C1;
  int m = 64;
  int n = 64;
  h_A  = new mt[m*n];
  h_C  = new mt[n*n];
  h_C1 = new mt[n*n];
  cudaMalloc(&d_A, m*n*sizeof(d_A[0]));
  cudaMalloc(&d_B, m*n*sizeof(d_A[0]));
  cudaMalloc(&d_C, n*n*sizeof(d_C[0]));
  // test 1
  for (int i = 0; i < m; i++)
    for (int j = 0; j < n; j++)
      h_A[i*n+j] = (i==j)?1.0f:0.0f;
  cudaMemcpy(d_A, h_A, m*n*sizeof(d_A[0]), cudaMemcpyHostToDevice);
  dim3 block(TILE_DIM, TILE_DIM);
  dim3 grid((n+block.x-1)/block.x, (n+block.y-1)/block.y);
  ATA<<<grid,block>>>(d_A, d_C, m, n);
  cudaMemcpy(h_C, d_C, n*n*sizeof(d_C[0]), cudaMemcpyDeviceToHost);
#ifdef DEBUG
  for (int i = 0; i < n; i++){
    for (int j = 0; j < n; j++)
      std::cout << h_C[i*n+j] << " ";
    std::cout << std::endl;}
  std::cout << std::endl;
#endif
  // test 2
  for (int i = 0; i < m; i++)
    for (int j = 0; j < n; j++)
      h_A[i*n+j] = rand()%10;
  cudaMemcpy(d_A, h_A, m*n*sizeof(d_A[0]), cudaMemcpyHostToDevice);
  ATA<<<grid,block>>>(d_A, d_C, m, n);
  cudaMemcpy(h_C, d_C, n*n*sizeof(d_C[0]), cudaMemcpyDeviceToHost);
#ifdef DEBUG
  for (int i = 0; i < n; i++){
    for (int j = 0; j < n; j++)
      std::cout << h_C[i*n+j] << " ";
    std::cout << std::endl;}
  std::cout << std::endl;
#endif
  transpose_naive<<<grid,block>>>(d_A, d_B, n);
  mm_naive<<<grid,block>>>(d_B, d_A, d_C, n, n, n);
  cudaMemcpy(h_C1, d_C, n*n*sizeof(d_C[0]), cudaMemcpyDeviceToHost);
#ifdef DEBUG
  for (int i = 0; i < n; i++){
    for (int j = 0; j < n; j++)
      std::cout << h_C1[i*n+j] << " ";
    std::cout << std::endl;}
  std::cout << std::endl;
#endif
  for (int i = 0; i < n*n; i++) if (h_C[i] != h_C1[i]) {std::cout << "mismatch at: " << i << " was: " << h_C[i] << " should be: " << h_C1[i] << std::endl; return 0;}
}
$ nvcc -o t1654 t1654.cu
$ cuda-memcheck ./t1654
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

请注意,加载 Bs 磁贴在两种情况下都是相同的.主要更改是在加载 As 磁贴时,还请注意在计算 Cvalue 时的索引更改.在这两种情况下,按列对这些更改进行索引都是必要的.

Note that loading the Bs tile is identical in both cases. The main changes are in loading the As tile, and also note the indexing change when computing Cvalue. These changes are necessary to index in both cases by column.

可能仍然存在错误.我没有测试过非正方形的情况,也没有测试过矩阵大小不是块大小的倍数的情况.此外,我没有利用输出中的对称性.但是,这应该有助于建立索引.

There may still be bugs. I have not tested the non-square case, nor have I tested the case where the matrix size is not a multiple of block size. Furthermore I've taken no advantage of the symmetry in the output. However this should help with the indexing.

这篇关于在CUDA中以平铺矩阵复数形式访问矩阵作为其转置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆