Numpy vs Cython速度 [英] Numpy vs Cython speed

查看:175
本文介绍了Numpy vs Cython速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个分析代码,它使用numpy进行了一些繁重的数值运算.出于好奇,尝试使用cython对其进行几乎没有任何更改的编译,然后使用numpy部分的循环将其重写.

I have an analysis code that does some heavy numerical operations using numpy. Just for curiosity, tried to compile it with cython with little changes and then I rewrote it using loops for the numpy part.

令我惊讶的是,基于循环的代码要快得多(8倍).我无法发布完整的代码,但是我将一个非常简单的不相关的计算放在一起,该计算显示了相似的行为(尽管时间差异不是很大):

To my surprise, the code based on loops was much faster (8x). I cannot post the complete code, but I put together a very simple unrelated computation that shows similar behavior (albeit the timing difference is not so big):

版本1(没有cython)

Version 1 (without cython)

import numpy as np

def _process(array):

    rows = array.shape[0]
    cols = array.shape[1]

    out = np.zeros((rows, cols))

    for row in range(0, rows):
        out[row, :] = np.sum(array - array[row, :], axis=0)

    return out

def main():
    data = np.load('data.npy')
    out = _process(data)
    np.save('vianumpy.npy', out)

版本2(使用cython构建模块)

Version 2 (building a module with cython)

import cython
cimport cython

import numpy as np
cimport numpy as np

DTYPE = np.float64
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):

    cdef unsigned int rows = array.shape[0]
    cdef unsigned int cols = array.shape[1]
    cdef unsigned int row
    cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))

    for row in range(0, rows):
        out[row, :] = np.sum(array - array[row, :], axis=0)

    return out

def main():
    cdef np.ndarray[DTYPE_t, ndim=2] data
    cdef np.ndarray[DTYPE_t, ndim=2] out
    data = np.load('data.npy')
    out = _process(data)
    np.save('viacynpy.npy', out)

版本3(使用cython构建模块)

Version 3 (building a module with cython)

import cython
cimport cython

import numpy as np
cimport numpy as np

DTYPE = np.float64
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):

    cdef unsigned int rows = array.shape[0]
    cdef unsigned int cols = array.shape[1]
    cdef unsigned int row
    cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))

    for row in range(0, rows):
        for col in range(0, cols):
            for row2 in range(0, rows):
                out[row, col] += array[row2, col] - array[row, col]

    return out

def main():
    cdef np.ndarray[DTYPE_t, ndim=2] data
    cdef np.ndarray[DTYPE_t, ndim=2] out
    data = np.load('data.npy')
    out = _process(data)
    np.save('vialoop.npy', out)

在data.npy中保存了10000x10矩阵后,时间为:

With a 10000x10 matrix saved in data.npy, the times are:

$ python -m timeit -c "from version1 import main;main()"
10 loops, best of 3: 4.56 sec per loop

$ python -m timeit -c "from version2 import main;main()"
10 loops, best of 3: 4.57 sec per loop

$ python -m timeit -c "from version3 import main;main()"
10 loops, best of 3: 2.96 sec per loop

这是预期的还是我缺少的优化?版本1和2提供相同结果的事实在某种程度上是可以预期的,但是为什么版本3更快?

Is this expected or is there an optimization that I am missing? The fact that version 1 and 2 gives the same result is somehow expected, but why version 3 is faster?

Ps.-这不是我需要进行的计算,只是一个显示相同内容的简单示例.

Ps.- This is NOT the calculation that I need to make, just a simple example that shows the same thing.

推荐答案

如其他答案所述,由于cython无法深入研究数组访问运算符以对其进行优化,因此版本2与版本1本质上相同.有两个原因

As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this

  • 首先,与优化的C代码相比,每次对numpy函数的调用都存在一定量的开销.但是,如果每个操作处理大数组,则此开销将变得不那么重要

  • First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays

第二,创建中间数组.如果您考虑使用更复杂的操作(例如out[row, :] = A[row, :] + B[row, :]*C[row, :]),这将更加清楚.在这种情况下,必须在内存中创建整个数组B*C,然后将其添加到A.这意味着CPU高速缓存正在被破坏,因为正在从内存读取数据并将数据写入内存,而不是将数据保存在CPU中并立即使用.重要的是,如果要处理大型数组,此问题将变得更加严重.

Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole array B*C must be created in memory, then added to A. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.

特别是因为您声明实际代码比示例复杂,并且显示出更大的加速性,所以我怀疑第二个原因可能是您案例中的主要因素.

Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.

顺便说一句,如果您的计算足够简单,则可以使用 numexpr 来克服这种影响,尽管cython当然在更多情况下很有用,所以它可能是您更好的选择.

As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.

这篇关于Numpy vs Cython速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆