如何使用 numba 在 GPU 上推广快速矩阵乘法 [英] How to generalize fast matrix multiplication on GPU using numba

查看:42
本文介绍了如何使用 numba 在 GPU 上推广快速矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我一直在尝试使用 Numba 库在 Python 中进行 GPU 编程.我一直在使用那里的教程在他们的网站上阅读它,目前我停留在他们的示例上,可以在这里找到:https://numba.pydata.org/numba-doc/latest/cuda/examples.html.我试图将快速矩阵乘法的示例概括一下(形式为 A*B=C).在测试时,我注意到维度不能被每块线程数 (TPB) 完全整除的矩阵不会产生正确的答案.

Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the number of threads per block (TPB) do not yield a correct answer.

我从 https://numba.pydata.org/numba-doc/latest/cuda/examples.html 并使用 4 x 4 矩阵创建了一个非常小的测试用例.如果我选择 TPB=2 一切都很好,但是当我设置 TPB=3 时就出错了.我知道代码超出了矩阵的范围,但我无法防止这种情况发生(我在 ty + i * TPBtx + i * 上尝试了一些 if 语句TPB,但这些没有用.

I copied the code below from the example at https://numba.pydata.org/numba-doc/latest/cuda/examples.html and created a very small test case with 4 by 4 matrices. If I choose TPB=2 everything is fine, but when I set TPB=3 it goes wrong. I understand that the code goes out of the bounds of the matrices, but I am unable to prevent this from happening (I tried some if statements on ty + i * TPB and tx + i * TPB, but these did not work.

from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary
        return

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    C[x, y] = tmp



#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)

我想编写一些不依赖于矩阵 A、B 和 C 的代码,这些矩阵的维度可以被 TPB 完全整除,因为这些有时是我无法控制的.我知道 GPU 只是对非常大的矩阵进行矩阵乘法会更快,但我想使用小示例来检查答案是否正确,然后再将其应用于实际数据.

I would like to write some code that is not dependent on the matrices A, B, and C having dimensions that are perfectly divisible by the TPB, as these are sometimes out of my control. I understand that GPUs are only faster with matrix multiplication for very large matrices, but I wanted to use small examples to be able to check whether the answer is correct before applying it to actual data.

推荐答案

可以说 发布的代码:

  1. 这不可能是正确的范围检查:

  1. This can't possibly be a correct range check:

if x >= C.shape[0] and y >= C.shape[1]:

为了让我们决定网格中的特定线程不执行任何加载活动,我们要求 要么 x 超出范围或 y 超出范围.and 应该是 or.

In order for us to decide that a particular thread in the grid not do any loading activity, we require either that x is out of range or that y is out of range. The and should have been an or.

这是非法 在条件代码中使用 cuda.syncthreads(),如果块中的所有线程都不能参与该语句.上面第 1 项中的先前 return 语句(即使从 and 更正为 or)几乎可以保证这种非法行为对于不完整的问题大小 -可被线程块大小整除.

It is illegal to use cuda.syncthreads() in conditional code, if all the threads in the block cannot participate in that statement. The previous return statement in item 1 above (even if corrected from and to or) pretty much guarantees this illegal behavior for problem sizes not whole-number-divisible by the threadblock size.

因此,要解决这些问题,我们不能仅对越界线程使用简单的 return 语句.相反,在加载点,我们必须只允许线程从全局内存加载到共享内存,如果计算的全局加载索引(对于 AB)在 -边界(根据定义,共享索引在边界内).此外,在编写结果时,我们必须只编写 C 入界的计算结果.

Therefore, to fix these issues, we cannot use just a simple return statement for an out-of-bounds thread. Instead, at the point of load, we must only allow threads to load from global to shared memory, if the computed global load indices (for A or B) are in-bounds (the shared indices are in-bounds, by definition). Furthermore, when writing a result, we must only write computed results that are in-bounds for C.

以下代码修复了这些项目.它似乎适用于您给定的测试用例:

The following code has those items fixed. It seems to work correctly for your given test case:

$ cat t49.py
from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = 0
        sB[tx, ty] = 0
        if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
          sA[tx, ty] = A[x, ty + i * TPB]
        if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
          sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if x < C.shape[0] and y < C.shape[1]:
        C[x, y] = tmp



#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$

(注意这里在边界测试中使用 是正确的.测试一组索引是否入站与测试一组索引是否出站相比在布尔意义上是不同的-of-bounds.在入界测试中,我们要求两者都在界内.在出界测试中,任一索引出界均不合格).

(Note the use of and here in the bounds tests is correct. Testing whether a set of indices are in-bound is different in a boolean sense compared to testing whether a set of indices is out-of-bounds. In the in-bounds test, we require both to be in-bounds. In the out-of-bounds test, either index out-of-bounds is disqualifying).

我并不是说上面的代码没有缺陷或适用于任何特定目的.提供它来演示我确定的问题的可能修复方法.正如您所发现的那样,让共享内存平铺矩阵乘法在每个可以想象的配置中工作都非常重要,而且除了此处显示的内容之外,我还没有对其进行测试.(例如,如果您决定将 TPB 设置为大于 32,则会遇到其他问题.另外,原始发布的代码仅针对方阵乘法进行宣传,这在一般非方情况下不起作用.)

I'm not suggesting the above code is defect-free or suitable for any particular purpose. It is offered to demonstrate possible fixes for the issues I identified. Getting a shared-memory tiled matrix multiply to work in every imaginable configuration is non-trivial, as you have discovered, and I've not tested it beyond what is shown here. (For example, if you decided to make TPB larger than 32, you would run into other problems. Also, the original posted code is advertised only for square matrix multiplication, and this will not work in the general non-square case.)

如上所述,已发布的代码和带有修复"功能的上述代码是不会正确处理一般的非正方形情况.我相信一些简单的修改将使我们能够处理非正方形的情况.简而言之,我们必须将网格的大小设置得足够大以处理两个输入矩阵的维度,同时仍然只为输出矩阵的入界值写入结果.这是一个经过简单测试的示例:

As noted above, the posted code and the above code with "fixes" will not correctly handle the general non-square case. I believe some straightforward modifications will allow us to handle the non-square case. In a nutshell, we must size the grid large enough to handle the dimensions of both input matrices, while still only writing results for the in-bounds values of the output matrix. Here is a lightly tested example:

$ cat t49.py
from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[ty, tx] = 0
        sB[ty, tx] = 0
        if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
          sA[ty, tx] = A[y, tx + i * TPB]
        if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
          sB[ty, tx] = B[ty + i * TPB, x]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[ty, j] * sB[j, tx]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp



#%%

x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253.  253.  253.  253.  253.  253.  253.]
 [ 782.  782.  782.  782.  782.  782.  782.]
 [1311. 1311. 1311. 1311. 1311. 1311. 1311.]
 [1840. 1840. 1840. 1840. 1840. 1840. 1840.]
 [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253.  253.  253.  253.  253.  253.  253.]
 [ 782.  782.  782.  782.  782.  782.  782.]
 [1311. 1311. 1311. 1311. 1311. 1311. 1311.]
 [1840. 1840. 1840. 1840. 1840. 1840. 1840.]
 [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$

我还重新排序了 xy 的含义(以及 txty 的用法)修复上述代码中的性能问题.原始发布的文档代码中也存在相同的性能问题.

I've also reordered the sense of x and y (and usage of tx and ty) to fix a performance issue in the above code. The same performance issue was present in the original posted doc code as well.

再次声明,没有无缺陷的声明.此外,我确信更优化"代码可以到达.然而,优化矩阵乘法是一项应该很快导致使用库实现的练习.在此处使用 cupy对于 GPU 方法来说,应该是在 GPU 上利用高质量矩阵乘法例程的一种相当简单的方法.

Again, no claims of defect free. Furthermore I'm sure "more optimal" code could be arrived at. However optimizing matrix multiplication is an exercise that should fairly quickly lead to using a library implementation. Using cupy here for the GPU approach should be a fairly straightforward way to tap into a high-quality matrix multiply routine on the GPU.

正如所讨论的此处 OP的代码(和看来,doc 示例) 也有表现关于 tmp 变量设置的问题.将其更改为适当的 32 位浮点变量会产生重要的性能差异.

As discussed here OP's code (and, it seems, the doc example) also had a performance issue around the setup of the tmp variable. Changing that to a proper 32-bit float variable makes an important performance difference.

这篇关于如何使用 numba 在 GPU 上推广快速矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆