在CUDA非正方形矩阵乘法 [英] Non-square matrix multiplication in CUDA

查看:260
本文介绍了在CUDA非正方形矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有关我的GPU编程课上,我们一直在负责完成非方形矩阵乘法程序的某些部分。具体来说,核函数初始化线程块和核心网层面。

For my GPU programming class, we've been tasked with completing certain parts of a non-square matrix multiplication program. Specifically, the kernel function and initializing the thread block and kernel grid dimensions.

我基础上,CUDA C语言编程指南的矩阵乘法code $我的C $ C,但不是使用结构,因为他们做的,我已经修改了我的只使用给出的参数(因为我们不是允许改变参数)。我们设有,分别为A B和C,以及他们 - 米×K,的K×n,且m×n个的尺寸,3矩阵。凡结构,用于A.height,我用维数m,它采用B.width,我用n维等。

I've based my code on the CUDA C Programming Guide's matrix multiplication code, but instead of using structs as they do, I have modified mine to use only the parameters given (since we're not allowed to change parameters). We are provided with the 3 matrices A, B, and C, as well as the dimensions of them- m x k, k x n, and m x n, respectively. Where the struct used A.height, I've used dimension m, where it used B.width, I've used dimension n, etc.

我碰到的几个问题,第一个是我的程序没有通​​过包括测试,验证产品矩阵C的正确性,我认为也有一些是错误的,我矩阵乘法code的话,那这个问题可能是由我产生适应结构code。

I've run into several problems, the first of which is that my program doesn't pass the included test, which verifies the correctness of the product matrix C. I assume that there is something wrong in my matrix multiplication code, then, and that the issue probably arises from me adapting the struct code.

#include <stdio.h>
__global__ void mysgemm(int m, int n, int k, const float *A, const float *B,
        float* C) {

    /********************************************************************
     *
     * Compute C = A x B
     *   where A is a (m x k) matrix
     *   where B is a (k x n) matrix
     *   where C is a (m x n) matrix
     *
     ********************************************************************/

    // INSERT KERNEL CODE HERE
    // Each thread computes one element of C
    // by accumulating results into Cvalue
    float Cvalue = 0;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    for (int e = 0; e < k; ++e){
        Cvalue += (A[row * k + e]) * (B[e * n + col]);
    }
    C[row * n + col] = Cvalue;
}

我的另一个问题,这我甚至都不那么确定,涉及到code初始化线程块和内核网格尺寸。

My other problem, which I'm even less sure about, involves the code to initialize the thread block and kernel grid dimensions.

// Initialize thread block and kernel grid dimensions ---------------------
    const unsigned int BLOCK_SIZE = 16; // Use 16x16 thread blocks
//INSERT CODE HERE
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(n / dimBlock.x, m / dimBlock.y);
// Invoke CUDA kernel -----------------------------------------------------
//INSERT CODE HERE
    mysgemm<<<dimGrid, dimBlock>>>(m, n, k, A, B, C);

我明白dimBlock,但我不明白dimGrid,并没有什么作为它的参数使用正确的想法。当我运行code作为是,内核甚至不会推出,如果我通过在基体没有一个维度是2的电源,如果我使用2的幂,测试仍失败。

I understand dimBlock, but I don't understand dimGrid, and don't have a proper idea of what to use as parameters for it. When I run the code as is, the kernel won't even launch if the matrix I pass in doesn't have a dimension that is a power of 2. And if I do use a power of 2, the test still fails.

我道歉,如果我已经太罗嗦了。这是我的第一篇文章,我想给尽可能多的细节尽可能。希望有人能帮助我走通过这些问题。

I apologize if I've been too wordy. This is my first post and I wanted to give as many details as possible. Hopefully someone can help walk me through these issues.

推荐答案

下面的内核,我下面张贴的是一个我张贴的一个变种

The following kernel I'm posting below is a variant of the one I posted in

<一个href=\"http://stackoverflow.com/questions/18815489/cuda-tiled-matrix-matrix-multiplication-with-shared-memory-and-matrix-size-whic\">CUDA:共享内存和矩阵大小平铺矩阵,矩阵的乘法这是块大小非多个

在于它不使用共享存储器

in that it does not use shared memory.

__global__ void MatMulNoShared(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {

    float CValue = 0;

    int Row = blockIdx.y*TILE_DIM + threadIdx.y;
    int Col = blockIdx.x*TILE_DIM + threadIdx.x;

    for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {

        for (int n = 0; n < TILE_DIM; ++n) 
            if ((k*TILE_DIM + n < ACols && Row < ARows) && (k*TILE_DIM + n < BRows && Col < BCols))
                CValue += A[Row*ACols + k*TILE_DIM + n] * B[(k*TILE_DIM + n)*BCols + Col];

    }

    if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}

在内核中的两个如果语句是由Eric的答复中提到的如果语句。

The two if statements in the kernel are the if statements mentioned in the answer by Eric.

为了您的方便起见,我张贴下面的完整code:

For the sake of your convenience, I'm posting the full code below:

#include <stdio.h>
#include <math.h>
#include <conio.h>

#define TILE_DIM 16                     // Tile dimension
#define DIMX 373                            
#define DIMY 242
#define DIMZ 533

__global__ void MatMulNoShared(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {

    float CValue = 0;

    int Row = blockIdx.y*TILE_DIM + threadIdx.y;
    int Col = blockIdx.x*TILE_DIM + threadIdx.x;

    for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {

        for (int n = 0; n < TILE_DIM; ++n) 
            if ((k*TILE_DIM + n < ACols && Row < ARows) && (k*TILE_DIM + n < BRows && Col < BCols))
                CValue += A[Row*ACols + k*TILE_DIM + n] * B[(k*TILE_DIM + n)*BCols + Col];

    }

    if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}

int main() {

    int CCols = DIMZ, CRows=DIMX, ACols=DIMY, ARows=DIMX, BCols=DIMZ, BRows=DIMY;

    dim3 dimBlock(TILE_DIM, TILE_DIM, 1);
    dim3 dimGrid;

    dimGrid.x = (CCols + dimBlock.x - 1)/dimBlock.x;
    dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;

    float *deviceA, *deviceB, *deviceC;

    float* hostA    = (float*)malloc(DIMX*DIMY*sizeof(float));
    float* hostB    = (float*)malloc(DIMY*DIMZ*sizeof(float));
    float* hostC    = (float*)malloc(DIMX*DIMZ*sizeof(float));
    float* hostCp   = (float*)malloc(DIMX*DIMZ*sizeof(float));

    for (int x = 0; x<DIMX; x++)
        for (int y = 0; y<DIMY; y++) {
            hostA[x*DIMY+y] = rand()/(float)RAND_MAX;
            hostB[x*DIMY+y] = rand()/(float)RAND_MAX;
        }

    cudaMalloc((void **)&deviceA, DIMX*DIMY*sizeof(float));
    cudaMalloc((void **)&deviceB, DIMY*DIMZ*sizeof(float));
    cudaMalloc((void **)&deviceC, DIMX*DIMZ*sizeof(float));

    cudaMemcpy(deviceA, hostA, DIMX*DIMY*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(deviceB, hostB, DIMY*DIMZ*sizeof(float), cudaMemcpyHostToDevice);

    MatMulNoShared<<<dimGrid , dimBlock>>>(deviceA , deviceB , deviceC , ARows , ACols, BRows ,BCols , CRows , CCols);

    cudaMemcpy(hostC, deviceC, DIMX*DIMZ*sizeof(float), cudaMemcpyDeviceToHost);

    return 0;
}

请注意,这两个指令

    dimGrid.x = (CCols + dimBlock.x - 1)/dimBlock.x;
    dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;

确保矩阵的全覆盖平铺,截至Eric的回答点1中提到。

ensure a full tiled coverage of the matrices, as mentioned at point 1. of Eric's answer.

这篇关于在CUDA非正方形矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆