为什么矩阵乘法与numpy的速度比使用Python ctypes的? [英] Why is matrix multiplication faster with numpy than with ctypes in Python?

查看:774
本文介绍了为什么矩阵乘法与numpy的速度比使用Python ctypes的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出做矩阵乘法的最快方法,并试图三种方式:

I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:


  • 纯Python实现:这里没有任何惊喜

  • 使用numpy的实施 numpy.dot(A,B)

  • 使用 ctypes的模块在Python用C接口。

  • Pure python implementation: no surprises here.
  • Numpy implementation using numpy.dot(a, b)
  • Interfacing with C using ctypes module in Python.

这是转化为一个共享库中的C code:

This is the C code that is transformed into a shared library:

#include <stdio.h>
#include <stdlib.h>

void matmult(float* a, float* b, float* c, int n) {
    int i = 0;
    int j = 0;
    int k = 0;

    /*float* c = malloc(nay * sizeof(float));*/

    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            int sub = 0;
            for (k = 0; k < n; k++) {
                sub = sub + a[i * n + k] * b[k * n + j];
            }
            c[i * n + j] = sub;
        }
    }
    return ;
}

和Python的code调用它:

And the Python code that calls it:

def C_mat_mult(a, b):
    libmatmult = ctypes.CDLL("./matmult.so")

    dima = len(a) * len(a)
    dimb = len(b) * len(b)

    array_a = ctypes.c_float * dima
    array_b = ctypes.c_float * dimb
    array_c = ctypes.c_float * dima

    suma = array_a()
    sumb = array_b()
    sumc = array_c()

    inda = 0
    for i in range(0, len(a)):
        for j in range(0, len(a[i])):
            suma[inda] = a[i][j]
            inda = inda + 1
        indb = 0
    for i in range(0, len(b)):
        for j in range(0, len(b[i])):
            sumb[indb] = b[i][j]
            indb = indb + 1

    libmatmult.matmult(ctypes.byref(suma), ctypes.byref(sumb), ctypes.byref(sumc), 2);

    res = numpy.zeros([len(a), len(a)])
    indc = 0
    for i in range(0, len(sumc)):
        res[indc][i % len(a)] = sumc[i]
        if i % len(a) == len(a) - 1:
            indc = indc + 1

    return res

我会打赌,用型C本来快...我想要失去了!下面是我的标杆,这似乎表明,无论是我做了错误的,或者说 numpy的是愚蠢快:

我想明白为什么 numpy的版本比 ctypes的版本速度更快,我不甚至在谈论纯Python实现,因为它是一种显而易见的。

I'd like to understand why the numpy version is faster than the ctypes version, I'm not even talking about the pure Python implementation since it is kind of obvious.

推荐答案

我不是太熟悉numpy的,但来源Github上。的点积一部分<一实施href=\"https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src\">https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src,这我假设被转化为具体的C实现每个数据类型。例如:

I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:

/**begin repeat
 *
 * #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
 * LONG, ULONG, LONGLONG, ULONGLONG,
 * FLOAT, DOUBLE, LONGDOUBLE,
 * DATETIME, TIMEDELTA#
 * #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
 * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
 * npy_float, npy_double, npy_longdouble,
 * npy_datetime, npy_timedelta#
 * #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,
 * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
 * npy_float, npy_double, npy_longdouble,
 * npy_datetime, npy_timedelta#
 */
static void
@name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,
           void *NPY_UNUSED(ignore))
{
    @out@ tmp = (@out@)0;
    npy_intp i;

    for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
        tmp += (@out@)(*((@type@ *)ip1)) *
               (@out@)(*((@type@ *)ip2));
    }
    *((@type@ *)op) = (@type@) tmp;
}
/**end repeat**/

此出现来计算一维点的产品,即在载体中。在我Github上浏览几分钟,我无法找到矩阵源,但它是可能的,它使用一个调用 FLOAT_dot 在结果矩阵中的每个元素。这意味着,在此功能的环对应于你内心最循环。

This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.

它们之间的一个区别是,步幅 - 中的输入连续元素之间的差 - 在调用之前明确计算一次。在万一没有步幅,并且每个输入的偏移被每次计算,例如 A [I * N + K] 。我本来期望编译好于客场类似numpy的步幅优化的东西,但也许不能证明步骤是一个常数(或它没有被优化)。

One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).

numpy的也可能会做一些聪明与上级code调用此函数缓存的影响。一种常见的伎俩是考虑每一行是连续的,或每列 - 并尝试第一次遍历每个连续的一部分。似乎很难是完全最优的,对于每个点积,一个输入矩阵必须由行遍历且另一个通过列(除非它们碰巧被存储在不同的主要顺序)。但它至少可以做到这一点的结果元素。

Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.

numpy的还包含code键选择特定的操作,包括点,从不同的基本实现方式的实施。例如它可以使用一个 BLAS 的库。从上面的讨论听起来好像是用来CBLAS。这是自Fortran语言翻译成C.我想在你的测试中使用的将是一个在这里找到实现:<一href=\"http://www.netlib.org/clapack/cblas/sdot.c\">http://www.netlib.org/clapack/cblas/sdot.c.

Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.

请注意,这个程序是为另一台机器读取写入由机器。但是你可以在底部看到,它的使用展开循环来一次处理5个元素:

Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:

for (i = mp1; i <= *n; i += 5) {
stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) * 
    SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4);
}

此展开的因素可能是一些分析后,都被接走。但它的一个理论优势是,更多的算术运算的每个分支点之间完成,编译器和CPU对如何以最佳方式安排它们得到尽可能多的指令流水线尽可能更多的选择。

This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.

这篇关于为什么矩阵乘法与numpy的速度比使用Python ctypes的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆