如何获得比numpy.dot更快的代码进行矩阵乘法? [英] How to get faster code than numpy.dot for matrix multiplication?

查看:123
本文介绍了如何获得比numpy.dot更快的代码进行矩阵乘法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用hdf5的矩阵乘法我使用hdf5(pytables)进行大矩阵乘法,但是我很惊讶,因为使用hdf5的速度甚至比使用普通numpy.dot更快,并且将矩阵存储在RAM中,这是什么原因?

Here Matrix multiplication using hdf5 I use hdf5 (pytables) for big matrix multiplication, but I was suprised because using hdf5 it works even faster then using plain numpy.dot and store matrices in RAM, what is the reason of this behavior?

也许在python中有一些更快的矩阵乘法函数,因为我仍然使用numpy.dot进行小块矩阵乘法.

And maybe there is some faster function for matrix multiplication in python, because I still use numpy.dot for small block matrix multiplication.

下面是一些代码:

假设矩阵可以容纳在RAM中:在10 * 1000 x 1000的矩阵上进行测试.

Assume matrices can fit in RAM: test on matrix 10*1000 x 1000.

使用默认的numpy(我认为没有BLAS lib). 普通的numpy数组在RAM中:时间9.48

Using default numpy(I think no BLAS lib). Plain numpy arrays are in RAM: time 9.48

如果RAM中的A,B,磁盘上的C:时间1.48

If A,B in RAM, C on disk: time 1.48

如果磁盘上的A,B,C:时间372.25

If A,B,C on disk: time 372.25

如果我将numpy与MKL一起使用,则结果为:0.15、0.45、43.5.

If I use numpy with MKL results are: 0.15,0.45,43.5.

结果看起来合理,但是我仍然不明白为什么在第一种情况下块乘法更快(当我们将A,B存储在RAM中时).

Results looks reasonable, but I still don't understand why in 1st case block multiplication is faster(when we store A,B in RAM).

n_row=1000
n_col=1000
n_batch=10

def test_plain_numpy():
    A=np.random.rand(n_row,n_col)# float by default?
    B=np.random.rand(n_col,n_row)
    t0= time.time()
    res= np.dot(A,B)
    print (time.time()-t0)

#A,B in RAM, C on disk
def test_hdf5_ram():
    rows = n_row
    cols = n_col
    batches = n_batch

    #using numpy array
    A=np.random.rand(n_row,n_col)
    B=np.random.rand(n_col,n_row)

    #settings for all hdf5 files
    atom = tables.Float32Atom() #if store uint8 less memory?
    filters = tables.Filters(complevel=9, complib='blosc') # tune parameters
    Nchunk = 128  # ?
    chunkshape = (Nchunk, Nchunk)
    chunk_multiple = 1
    block_size = chunk_multiple * Nchunk

    #using hdf5
    fileName_C = 'CArray_C.h5'
    shape = (A.shape[0], B.shape[1])

    h5f_C = tables.open_file(fileName_C, 'w')
    C = h5f_C.create_carray(h5f_C.root, 'CArray', atom, shape, chunkshape=chunkshape, filters=filters)

    sz= block_size

    t0= time.time()
    for i in range(0, A.shape[0], sz):
        for j in range(0, B.shape[1], sz):
            for k in range(0, A.shape[1], sz):
                C[i:i+sz,j:j+sz] += np.dot(A[i:i+sz,k:k+sz],B[k:k+sz,j:j+sz])
    print (time.time()-t0)

    h5f_C.close()
def test_hdf5_disk():
    rows = n_row
    cols = n_col
    batches = n_batch

    #settings for all hdf5 files
    atom = tables.Float32Atom() #if store uint8 less memory?
    filters = tables.Filters(complevel=9, complib='blosc') # tune parameters
    Nchunk = 128  # ?
    chunkshape = (Nchunk, Nchunk)
    chunk_multiple = 1
    block_size = chunk_multiple * Nchunk


    fileName_A = 'carray_A.h5'
    shape_A = (n_row*n_batch, n_col)  # predefined size

    h5f_A = tables.open_file(fileName_A, 'w')
    A = h5f_A.create_carray(h5f_A.root, 'CArray', atom, shape_A, chunkshape=chunkshape, filters=filters)

    for i in range(batches):
        data = np.random.rand(n_row, n_col)
        A[i*n_row:(i+1)*n_row]= data[:]

    rows = n_col
    cols = n_row
    batches = n_batch

    fileName_B = 'carray_B.h5'
    shape_B = (rows, cols*batches)  # predefined size

    h5f_B = tables.open_file(fileName_B, 'w')
    B = h5f_B.create_carray(h5f_B.root, 'CArray', atom, shape_B, chunkshape=chunkshape, filters=filters)

    sz= rows/batches
    for i in range(batches):
        data = np.random.rand(sz, cols*batches)
        B[i*sz:(i+1)*sz]= data[:]


    fileName_C = 'CArray_C.h5'
    shape = (A.shape[0], B.shape[1])

    h5f_C = tables.open_file(fileName_C, 'w')
    C = h5f_C.create_carray(h5f_C.root, 'CArray', atom, shape, chunkshape=chunkshape, filters=filters)

    sz= block_size

    t0= time.time()
    for i in range(0, A.shape[0], sz):
        for j in range(0, B.shape[1], sz):
            for k in range(0, A.shape[1], sz):
                C[i:i+sz,j:j+sz] += np.dot(A[i:i+sz,k:k+sz],B[k:k+sz,j:j+sz])
    print (time.time()-t0)

    h5f_A.close()
    h5f_B.close()
    h5f_C.close()

推荐答案

np.dot调度到 BLAS 何时

  • NumPy已被编译为使用BLAS,
  • BLAS实现可在运行时使用,
  • 您的数据具有dtypes float32float64complex32complex64中的一种,并且
  • 数据在内存中适当对齐.
  • NumPy has been compiled to use BLAS,
  • a BLAS implementation is available at run-time,
  • your data has one of the dtypes float32, float64, complex32 or complex64, and
  • the data is suitably aligned in memory.

否则,它默认使用自己的慢速矩阵乘法例程.

Otherwise, it defaults to using its own, slow, matrix multiplication routine.

此处中介绍了检查BLAS链接的信息. .简而言之,请检查您的NumPy安装中是否有文件_dotblas.so或类似文件.如果存在,请检查它链接到哪个BLAS库;参考BLAS较慢,ATLAS较快,OpenBLAS和特定于供应商的版本(例如Intel MKL)甚至更快.当心多线程BLAS实现,因为它们使用Python的multiprocessing效果不佳.

Checking your BLAS linkage is described here. In short, check whether there's a file _dotblas.so or similar in your NumPy installation. When there is, check which BLAS library it's linked against; the reference BLAS is slow, ATLAS is fast, OpenBLAS and vendor-specific versions such as Intel MKL are even faster. Watch out with multithreaded BLAS implementations as they don't play nicely with Python's multiprocessing.

接下来,通过检查阵列的flags来检查数据对齐方式.在1.7.2之前的NumPy版本中,np.dot的两个参数都应按C顺序排列.在NumPy> = 1.7.2中,这不再重要,因为已经引入了Fortran数组的特殊情况.

Next, check your data alignment by inspecting the flags of your arrays. In versions of NumPy before 1.7.2, both arguments to np.dot should be C-ordered. In NumPy >= 1.7.2, this doesn't matter as much anymore as special cases for Fortran arrays have been introduced.

>>> X = np.random.randn(10, 4)
>>> Y = np.random.randn(7, 4).T
>>> X.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False
>>> Y.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

如果您的NumPy没有与BLAS链接,请(轻松)重新安装它,或(硬)使用SciPy的BLAS gemm(广义矩阵乘法)功能:

If your NumPy is not linked against BLAS, either (easy) re-install it, or (hard) use the BLAS gemm (generalized matrix multiply) function from SciPy:

>>> from scipy.linalg import get_blas_funcs
>>> gemm = get_blas_funcs("gemm", [X, Y])
>>> np.all(gemm(1, X, Y) == np.dot(X, Y))
True

这看起来很简单,但是几乎不进行任何错误检查,因此您必须真正知道自己在做什么.

This looks easy, but it does hardly any error checking, so you must really know what you're doing.

这篇关于如何获得比numpy.dot更快的代码进行矩阵乘法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆