在CUDA非正方形矩阵乘法 [英] Non-square matrix multiplication in CUDA
问题描述
有关我的GPU编程课上,我们一直在负责完成非方形矩阵乘法程序的某些部分。具体来说,核函数初始化线程块和核心网层面。
For my GPU programming class, we've been tasked with completing certain parts of a non-square matrix multiplication program. Specifically, the kernel function and initializing the thread block and kernel grid dimensions.
我基础上,CUDA C语言编程指南的矩阵乘法code $我的C $ C,但不是使用结构,因为他们做的,我已经修改了我的只使用给出的参数(因为我们不是允许改变参数)。我们设有,分别为A B和C,以及他们 - 米×K,的K×n,且m×n个的尺寸,3矩阵。凡结构,用于A.height,我用维数m,它采用B.width,我用n维等。
I've based my code on the CUDA C Programming Guide's matrix multiplication code, but instead of using structs as they do, I have modified mine to use only the parameters given (since we're not allowed to change parameters). We are provided with the 3 matrices A, B, and C, as well as the dimensions of them- m x k, k x n, and m x n, respectively. Where the struct used A.height, I've used dimension m, where it used B.width, I've used dimension n, etc.
我碰到的几个问题,第一个是我的程序没有通过包括测试,验证产品矩阵C的正确性,我认为也有一些是错误的,我矩阵乘法code的话,那这个问题可能是由我产生适应结构code。
I've run into several problems, the first of which is that my program doesn't pass the included test, which verifies the correctness of the product matrix C. I assume that there is something wrong in my matrix multiplication code, then, and that the issue probably arises from me adapting the struct code.
#include <stdio.h>
__global__ void mysgemm(int m, int n, int k, const float *A, const float *B,
float* C) {
/********************************************************************
*
* Compute C = A x B
* where A is a (m x k) matrix
* where B is a (k x n) matrix
* where C is a (m x n) matrix
*
********************************************************************/
// INSERT KERNEL CODE HERE
// Each thread computes one element of C
// by accumulating results into Cvalue
float Cvalue = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
for (int e = 0; e < k; ++e){
Cvalue += (A[row * k + e]) * (B[e * n + col]);
}
C[row * n + col] = Cvalue;
}
我的另一个问题,这我甚至都不那么确定,涉及到code初始化线程块和内核网格尺寸。
My other problem, which I'm even less sure about, involves the code to initialize the thread block and kernel grid dimensions.
// Initialize thread block and kernel grid dimensions ---------------------
const unsigned int BLOCK_SIZE = 16; // Use 16x16 thread blocks
//INSERT CODE HERE
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(n / dimBlock.x, m / dimBlock.y);
// Invoke CUDA kernel -----------------------------------------------------
//INSERT CODE HERE
mysgemm<<<dimGrid, dimBlock>>>(m, n, k, A, B, C);
我明白dimBlock,但我不明白dimGrid,并没有什么作为它的参数使用正确的想法。当我运行code作为是,内核甚至不会推出,如果我通过在基体没有一个维度是2的电源,如果我使用2的幂,测试仍失败。
I understand dimBlock, but I don't understand dimGrid, and don't have a proper idea of what to use as parameters for it. When I run the code as is, the kernel won't even launch if the matrix I pass in doesn't have a dimension that is a power of 2. And if I do use a power of 2, the test still fails.
我道歉,如果我已经太罗嗦了。这是我的第一篇文章,我想给尽可能多的细节尽可能。希望有人能帮助我走通过这些问题。
I apologize if I've been too wordy. This is my first post and I wanted to give as many details as possible. Hopefully someone can help walk me through these issues.
推荐答案
下面的内核,我下面张贴的是一个我张贴的一个变种
The following kernel I'm posting below is a variant of the one I posted in
<一个href=\"http://stackoverflow.com/questions/18815489/cuda-tiled-matrix-matrix-multiplication-with-shared-memory-and-matrix-size-whic\">CUDA:共享内存和矩阵大小平铺矩阵,矩阵的乘法这是块大小非多个
在于它不使用共享存储器
in that it does not use shared memory.
__global__ void MatMulNoShared(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {
float CValue = 0;
int Row = blockIdx.y*TILE_DIM + threadIdx.y;
int Col = blockIdx.x*TILE_DIM + threadIdx.x;
for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {
for (int n = 0; n < TILE_DIM; ++n)
if ((k*TILE_DIM + n < ACols && Row < ARows) && (k*TILE_DIM + n < BRows && Col < BCols))
CValue += A[Row*ACols + k*TILE_DIM + n] * B[(k*TILE_DIM + n)*BCols + Col];
}
if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}
在内核中的两个如果
语句是由Eric的答复中提到的如果
语句。
The two if
statements in the kernel are the if
statements mentioned in the answer by Eric.
为了您的方便起见,我张贴下面的完整code:
For the sake of your convenience, I'm posting the full code below:
#include <stdio.h>
#include <math.h>
#include <conio.h>
#define TILE_DIM 16 // Tile dimension
#define DIMX 373
#define DIMY 242
#define DIMZ 533
__global__ void MatMulNoShared(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {
float CValue = 0;
int Row = blockIdx.y*TILE_DIM + threadIdx.y;
int Col = blockIdx.x*TILE_DIM + threadIdx.x;
for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {
for (int n = 0; n < TILE_DIM; ++n)
if ((k*TILE_DIM + n < ACols && Row < ARows) && (k*TILE_DIM + n < BRows && Col < BCols))
CValue += A[Row*ACols + k*TILE_DIM + n] * B[(k*TILE_DIM + n)*BCols + Col];
}
if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}
int main() {
int CCols = DIMZ, CRows=DIMX, ACols=DIMY, ARows=DIMX, BCols=DIMZ, BRows=DIMY;
dim3 dimBlock(TILE_DIM, TILE_DIM, 1);
dim3 dimGrid;
dimGrid.x = (CCols + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;
float *deviceA, *deviceB, *deviceC;
float* hostA = (float*)malloc(DIMX*DIMY*sizeof(float));
float* hostB = (float*)malloc(DIMY*DIMZ*sizeof(float));
float* hostC = (float*)malloc(DIMX*DIMZ*sizeof(float));
float* hostCp = (float*)malloc(DIMX*DIMZ*sizeof(float));
for (int x = 0; x<DIMX; x++)
for (int y = 0; y<DIMY; y++) {
hostA[x*DIMY+y] = rand()/(float)RAND_MAX;
hostB[x*DIMY+y] = rand()/(float)RAND_MAX;
}
cudaMalloc((void **)&deviceA, DIMX*DIMY*sizeof(float));
cudaMalloc((void **)&deviceB, DIMY*DIMZ*sizeof(float));
cudaMalloc((void **)&deviceC, DIMX*DIMZ*sizeof(float));
cudaMemcpy(deviceA, hostA, DIMX*DIMY*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(deviceB, hostB, DIMY*DIMZ*sizeof(float), cudaMemcpyHostToDevice);
MatMulNoShared<<<dimGrid , dimBlock>>>(deviceA , deviceB , deviceC , ARows , ACols, BRows ,BCols , CRows , CCols);
cudaMemcpy(hostC, deviceC, DIMX*DIMZ*sizeof(float), cudaMemcpyDeviceToHost);
return 0;
}
请注意,这两个指令
dimGrid.x = (CCols + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;
确保矩阵的全覆盖平铺,截至Eric的回答点1中提到。
ensure a full tiled coverage of the matrices, as mentioned at point 1. of Eric's answer.
这篇关于在CUDA非正方形矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!