使用 CUDA 进行动态矩阵乘法 [英] Dynamic matrix multiplication with CUDA

查看:21
本文介绍了使用 CUDA 进行动态矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试编写的简单程序的想法是从用户那里获取输入,以查看要乘以多大的矩阵.

The idea of my simple program that I've been trying to write is to take input from the user to see how large of a matrix to multiply.

我希望将输入 x 乘以 x,目前我不希望将两个不同的大小相乘.

I am looking to take the input x by x, I am not currently looking to multiply two different sizes at the moment.

你们会建议我如何完成这项工作?

How would you guys suggest I go about accomplishing this?

很抱歉我的问题不够清楚,我想修改这个内核,以便它可以处理任何大小的矩阵(其中 x 和 y 是等价的,以保持简单).而不是 16 的倍数.

I'm sorry my question was not clear enough, I want to modify this kernel so that it can handle a matrix of any size(where the x and y are equivalents to keep it simple). Instead of multiples of 16.

我不确定您是否需要我当前的代码,但这里是内核代码:

I'm not sure if you would need my current code but here is the kernel code:

// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int aBegin = wA * block_size * by;
    int aEnd   = aBegin + wA - 1;
    int aStep  = block_size;

    int bBegin = block_size * bx;

    int bStep  = block_size * wB;
    float Csub=0;

    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        extern __shared__ float As[];
        extern __shared__ float Bs[];
        extern __shared__ float smem[];

        smem[ty*block_size+tx] = A[a + wA * ty + tx];

        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];

        __syncthreads();

        for (int k = 0; k < block_size; ++k)
            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;

        __syncthreads();
    }

    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;


}

更新:我决定使用零填充.但是我得到的答案不正确.取矩阵 A 2x2,填充为 16x16:

Update: I decided to go with the zero padding. However I am getting incorrect answers. Take matrix A 2x2, padded to 16x16:

5.000 0.000 9.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

矩阵 B,2x2 填充到 16x16:

Matrix B, 2x2 padded to 16x16:

7.000 4.000 8.000 7.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

所以我得到的 C 的结果是正确的:

So the result for C I get is correct:

35.000 20.000 40.000 35.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000

但是,如果去掉零,矩阵应该是:答:

However if you strip away the zeros the matrix should be: A:

5.000 0.000
9.000 0.000

乙:

7.000 4.000
8.000 7.000

C 应该是:

35.000 20.000
63.000 36.000

但是两个矩阵C并不相同.

However the two matrix Cs are not the same.

推荐答案

这不是一个很明确的问题,所以这个答案是基于你之前在几个相当相似的问题中提出的问题的猜测.

This isn't a very clear question, so this answer is something of a guess based on what you have previous asked in several rather similar questions earlier.

p>

了解如何进行此类运算的一个很好的起点是回到起点,从第一原理开始考虑矩阵-矩阵乘法问题.您对计算两个矩阵的点积 C = AB 的代码感兴趣.您的限制是您使用的内核只能计算矩阵的乘积,这些矩阵是某些内部块大小的整数倍.那你能做什么呢?

A good starting point to understanding how to do this sort of operation is to go back to the beginning and think about the matrix-matrix multiplication problem from first principles. You are interested in code to calculate the dot product of two matrices, C = AB. The restriction you have is that the kernel you are using can only compute products of matrices which are round multiples of some internal block size. So what can you do?

查看问题的一种方法是想象 AB 矩阵是 块矩阵.矩阵乘法可以这样写:

One way to look at the problem is to imagine that A and B matrices were block matrices. The matrix multiply can be written like this:

然后得到的矩阵C可以由AB中八个子矩阵的乘积组合形成:

and the resulting matrix C can then by formed by combinations of the products of the eight submatrices in A and B:

这如何帮助解决问题可能不是很明显,但让我们考虑一个具体的例子:

It might not be immediately obvious how this helps solve the problem, but let's consider a concrete example:

  1. 您有一个 最佳 矩阵乘法内核,它使用 32 的内部块大小,并且仅当矩阵是该块大小的整数倍时才正确.
  2. 您有一对 1000x1000 的方阵要相乘.
  1. You have an optimal matrix multiplication kernel which uses an internal block size of 32, and is only correct when matrices are round multiples of that block size.
  2. You have a pair of 1000x1000 square matrices to multiply.

这些第一个事实意味着您的内核只能正确解决 1024x1024 乘积或 992x992 乘积,但不能正确解决您需要的 1000x1000 运算.

These first facts implies that your kernel can only correctly solve either a 1024x1024 product, or a 992x992 product, but not the 1000x1000 operation you need.

如果你决定使用 1024x1024 的产品,你可以使用块分解的思想来制定这样的问题:

If you decide to use a 1024x1024 product, you can use the block decomposition idea to formulate the problem like this:

其中 Onn 表示适当大小的零矩阵.现在你有一对 1024x1024 矩阵,它们的乘积将产生

where Onn denotes a suitably sized matrix of zeros. Now you have a pair of 1024x1024 matrices, and their product will result in

即.左边的上块是一个包含 AB 的 1000x1000 矩阵.这实际上是 零填充 以获得正确的结果.在此示例中,这意味着执行的计算量比所需的多 7%.是否重要可能是特定于应用程序的.

ie. the left hand, upper block is a 1000x1000 matrix containing AB. This is effectively zero padding to achieve the correct result. In this example, it means that about 7% more computation is performed than is required. Whether than is important or not is probably application specific.

第二种方法是使用基本内核来计算一个 992x992 的乘积,然后在块分解版本的计算中制定一个策略来处理其他七个乘积,如下所示:

The second approach would be to use the basic kernel to compute a 992x992 product, then work out a strategy to deal with the other seven products in the block decomposed version of the calculation, something like this:

A11B11 为 992x992 矩阵,Onn 和以前一样是零矩阵.乍一看,这看起来不是很有帮助,但值得记住的是,所有右侧矩阵的计算只包含计算矩阵乘积所需的总计算量的大约 1.2%.它们可以在 GPU 进行主要计算时在主机 CPU 上轻松完成,然后添加到 GPU 结果中以形成最终矩阵.因为 CUDA API 是异步的,所以大部分主机计算可以完全隐藏并且实际上是免费的.

with A11 and B11 being 992x992 matrices, and Onn are zero matrices as before. At first inspection this doesn't look very helpful, but it is worth remembering that all the calculations to make the right hand side matrix contain only about 1.2% of the total computations required to compute the matrix product. They could easily be done on the host CPU while the GPU is doing the main calculation, and then added to the GPU result to form the final matrix. Because the CUDA API is asynchronous, most of that host calculation can be completely hidden and is effectively free.

此答案包含两种策略,用于执行您所要求的操作而无需更改多行当前内核代码.显然还有第三种方法,即更彻底地修改内核本身,但这是您应该先尝试自己的方法,如果您的解决方案不起作用,请寻求帮助.

This answer contains two strategies for doing what it is you are asking for without changing more than single line of your current kernel code. There is obviously a third way, which is to more radically modify the kernel itself, but that is something you should try yourself first and then ask for help if your solution doesn't work.

这篇关于使用 CUDA 进行动态矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆