CUDA中的非平方矩阵乘法 [英] Non Square Matrix Multiplication in CUDA

查看:116
本文介绍了CUDA中的非平方矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在CUDA中用于矩阵乘法的代码使我可以将平方和非平方矩阵相乘,但是,宽度和高度都必须是块大小的倍数.

The code I use for matrix multiplications in CUDA lets me multiply both square and non square matrices, however, both Width and Height MUST be multiples of blocksize.

例如,我可以乘以[3] [6] * [6] [3](使用blocksize = 3),但不能乘以[3] [2] * [2] [3]

So, for example, I can multiply [3][6] * [6][3] (using blocksize=3), but I can't multiply [3][2]*[2][3].

有人知道这样做的方法吗?这是我的内核:

Does anyone knows a way to do that? This is my kernel:

#include <stdio.h>

#include <limits.h>

#include <stdlib.h>
#define blocksize 3
#define HM (1*blocksize) 
#define WM (2*blocksize) 
#define WN (1*blocksize)
#define HN WM 
#define WP WN   
#define HP HM  
#define PTH WM
#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)

{
__shared__ float MS[blocksize][blocksize];
__shared__ float NS[blocksize][blocksize];


int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
int rowM=ty+by*blocksize;
int colN=tx+bx*blocksize;
float Pvalue=0;


for(int m=0; m< uWM/blocksize;++m){
    MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];
    NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];
    __syncthreads();

    for(int k=0;k<blocksize;k++)
        Pvalue+=MS[ty][k]*NS[k][tx];
    __syncthreads();
    P[rowM*WP+colN]=Pvalue;
     }
    }

提前谢谢!

推荐答案

我认为最简单的方法是在块的末尾填充零:

I think the easiest thing to do would be to just pad the blocks on the end with zeros:

for(int m=0; m< uWM/blocksize;++m){
    colM = m*blocksize+tx;
    rowN = m*blocksize+ty;
    if (rowM > uWN || rowN > uWM || colM > uWM || colN > uWN) {
        MS[ty][tx]=0.;
        NS[ty][tx]=0.;
    } else {
        MS[ty][tx]=M[rowM*uWM+colM];
        NS[ty][tx]=N[colN + uWN*rowN];
    }

正负. (那条NS行应该引用N,而不是M,对吧?)

plus or minus. (That NS line should reference N, not M, right?)

但是,由于我似乎是唯一一个在可能的情况下主张使用现有调优库的人-为什么不使用 MAGMA 而不是自己滚动?它们速度很快,并经过数百名用户的测试.

But, since I seem to be the only one here advocating using existing tuned libraries when possible -- why not use CUBLAS or MAGMA instead of rolling your own? They're fast, and tested by hundreds of users.

这篇关于CUDA中的非平方矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆