由于大量的Numpy点调用而使开销最小化 [英] Minimizing overhead due to the large number of Numpy dot calls

查看:132
本文介绍了由于大量的Numpy点调用而使开销最小化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我的问题,我有一个迭代算法,以便在每次迭代中都需要执行几个矩阵矩阵乘法点( A_i B_i ), i = 1 ... k由于这些乘法是通过Numpy的点执行的,因此我知道它们正在调用BLAS-3实现,这是非常快的.问题在于调用的数量巨大,而事实证明这是我程序中的瓶颈.我想通过生产较少的产品但使用更大的矩阵来最大程度地减少所有这些调用的开销.

My problem is the following, I have an iterative algorithm such that at each iteration it needs to perform several matrix-matrix multiplications dot(A_i, B_i), for i = 1 ... k. Since these multiplications are being performed with Numpy's dot, I know they are calling BLAS-3 implementation, which is quite fast. The problem is that the number of calls is huge and it turned out to be a bottleneck in my program. I would like to minimize the overhead due all these calls by making less products but with bigger matrices.

为简单起见,请考虑所有矩阵均为n x n(通常n不大,范围在1到1000之间).解决我的问题的一种方法是考虑块对角矩阵diag( A_i )并执行以下乘积.

For simplicity, consider that all matrices are n x n (usually n is not big, it ranges between 1 and 1000). One way around to my problem would be to consider the block diagonal matrix diag(A_i) and perform the product below.

这只是对函数点的一次调用,但是现在程序在执行与零的乘法运算时浪费了很多时间.这个想法似乎行不通,但可以得出结果[ A_1 B_1 ,..., A_k B_k ],即所有产品堆叠在一个大矩阵中

This is just one call to the function dot but now the program wastes a lot of times performing multiplication with zeros. This idea doesn't seem to work but it gives the result [A_1 B_1, ..., A_k B_k], that is, all products stacked in a single big matrix.

我的问题是这样,有没有一种方法可以通过单个函数调用来计算[ A_1 B_1 ,..., A_k B_k ]?甚至更重要的是,我如何比循环形成多个Numpy点更快地计算这些乘积?

My question is this, is there a way to compute [A_1 B_1, ..., A_k B_k] with a single function call? Or even more to the point, how can I compute these products faster than making a loop of Numpy dots?

推荐答案

这取决于矩阵的大小

修改

对于较大的nxn矩阵(大约大小20),从编译代码进行BLAS调用会更快,对于较小的矩阵,自定义Numba或Cython内核通常会更快.

For larger nxn matrices (aprox. size 20) a BLAS call from compiled code is faster, for smaller matrices custom Numba or Cython Kernels are usually faster.

以下方法为给定的输入形状生成自定义的点函数.通过这种方法,还可以受益于与编译器相关的优化,例如循环展开,这对于小型矩阵尤其重要.

The following method generates custom dot- functions for given input shapes. With this method it is also possible to benefit from compiler related optimizations like loop unrolling, which are especially important for small matrices.

必须注意的是,生成和编译一个内核大约需要花费大约2倍的时间. 1s,因此请确保仅在确实需要时才调用生成器.

It has to be noted, that generating and compiling one kernel takes approx. 1s, therefore make sure to call the generator only if you really have to.

发电机功能

def gen_dot_nm(x,y,z):
    #small kernels
    @nb.njit(fastmath=True,parallel=True)
    def dot_numba(A,B):
        """
        calculate dot product for (x,y)x(y,z)
        """
        assert A.shape[0]==B.shape[0]
        assert A.shape[2]==B.shape[1]

        assert A.shape[1]==x
        assert B.shape[1]==y
        assert B.shape[2]==z

        res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
        for ii in nb.prange(A.shape[0]):
            for i in range(x):
                for j in range(z):
                    acc=0.
                    for k in range(y):
                        acc+=A[ii,i,k]*B[ii,k,j]
                    res[ii,i,j]=acc
        return res

    #large kernels
    @nb.njit(fastmath=True,parallel=True)
    def dot_BLAS(A,B):
        assert A.shape[0]==B.shape[0]
        assert A.shape[2]==B.shape[1]

        res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
        for ii in nb.prange(A.shape[0]):
            res[ii]=np.dot(A[ii],B[ii])
        return res

    #At square matices above size 20
    #calling BLAS is faster
    if x>=20 or y>=20 or z>=20:
        return dot_BLAS
    else:
        return dot_numba

用法示例

A=np.random.rand(1000,2,2)
B=np.random.rand(1000,2,2)

dot22=gen_dot_nm(2,2,2)
X=dot22(A,B)
%timeit X3=dot22(A,B)
#5.94 µs ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 

旧答案

另一个替代方案,但还有更多工作要做,那就是使用一些特殊的BLAS实现,它会创建自定义内核,用于非常小的矩阵,并且可以及时从C调用此内核

Another alternative, but with more work to do, would be to use some special BLAS implementations, which creates custom kernels for very small matrices just in time and than calling this kernels from C.

示例

import numpy as np
import numba as nb

#Don't use this for larger submatrices
@nb.njit(fastmath=True,parallel=True)
def dot(A,B):
    assert A.shape[0]==B.shape[0]
    assert A.shape[2]==B.shape[1]

    res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
    for ii in nb.prange(A.shape[0]):
        for i in range(A.shape[1]):
            for j in range(B.shape[2]):
                acc=0.
                for k in range(B.shape[1]):
                    acc+=A[ii,i,k]*B[ii,k,j]
                res[ii,i,j]=acc
    return res

@nb.njit(fastmath=True,parallel=True)
def dot_22(A,B):
    assert A.shape[0]==B.shape[0]
    assert A.shape[1]==2
    assert A.shape[2]==2
    assert B.shape[1]==2
    assert B.shape[2]==2

    res=np.empty((A.shape[0],A.shape[1],B.shape[2]),dtype=A.dtype)
    for ii in nb.prange(A.shape[0]):
        res[ii,0,0]=A[ii,0,0]*B[ii,0,0]+A[ii,0,1]*B[ii,1,0]
        res[ii,0,1]=A[ii,0,0]*B[ii,0,1]+A[ii,0,1]*B[ii,1,1]
        res[ii,1,0]=A[ii,1,0]*B[ii,0,0]+A[ii,1,1]*B[ii,1,0]
        res[ii,1,1]=A[ii,1,0]*B[ii,0,1]+A[ii,1,1]*B[ii,1,1]
    return res

时间

A=np.random.rand(1000,2,2)
B=np.random.rand(1000,2,2)

X=A@B
X2=np.einsum("xik,xkj->xij",A,B)
X3=dot_22(A,B) #avoid measurig compilation overhead
X4=dot(A,B)    #avoid measurig compilation overhead

%timeit X=A@B
#262 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.einsum("xik,xkj->xij",A,B,optimize=True)
#264 µs ± 3.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit X3=dot_22(A,B)
#5.68 µs ± 27.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit X4=dot(A,B)
#9.79 µs ± 61.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

这篇关于由于大量的Numpy点调用而使开销最小化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆