cython会执行哪些numpy优化? [英] What numpy optimizations does cython do?

查看:77
本文介绍了cython会执行哪些numpy优化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很惊讶地发现:

# fast_ops_c.pyx
cimport cython
cimport numpy as np

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
@cython.nonecheck(False)
def c_iseq_f1(np.ndarray[np.double_t, ndim=1, cast=False] x, double val):
    # Test (x==val) except gives NaN where x is NaN
    cdef np.ndarray[np.double_t, ndim=1] result = np.empty_like(x)
    cdef size_t i = 0
    cdef double _x = 0
    for i in range(len(x)):
        _x = x[i]
        result[i] = (_x-_x) + (_x==val)
    return result

的数量级或幅度快于:

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
@cython.nonecheck(False)
def c_iseq_f2(np.ndarray[np.double_t, ndim=1, cast=False] x, double val):
    cdef np.ndarray[np.double_t, ndim=1] result = np.empty_like(x)
    cdef size_t i = 0
    cdef double _x = 0
    for _x in x:        # Iterate over elements
        result[i] = (_x-_x) + (_x==val)
    return result

(用于大型数组).我正在使用以下工具测试性能:

(for large arrays). I'm using the following to test the performance:

# fast_ops.py
try:
    import pyximport
    pyximport.install(setup_args={"include_dirs": np.get_include()}, reload_support=True)
except Exception:
    pass

from fast_ops_c import *
import math
import nump as np

NAN = float("nan")

import unittest
class FastOpsTest(unittest.TestCase):

    def test_eq_speed(self):
        from timeit import timeit
        a = np.random.random(500000)
        a[1] = 2.
        a[2] = NAN

        a2 = c_iseq_f(a, 2.)
        def f1(): c_iseq_f2(a, 2.)
        def f2(): c_iseq_f1(a, 2.)

        # warm up
        [f1() for x in range(20)]
        [f2() for x in range(20)]

        n=1000
        dur = timeit(f1, number=n)
        print dur, "DUR1 s/iter", dur/n

        dur = timeit(f2, number=n)
        print dur, "DUR2 s/iter", dur/n

        dur = timeit(f1, number=n)

        print dur, "DUR1 s/iter", dur/n
        assert dur/n <= 0.005

        dur = timeit(f2, number=n)
        print dur, "DUR2 s/iter", dur/n

        print a2[:10]
        assert a2[0] == 0.
        assert a2[1] == 1.
        assert math.isnan(a2[2])

我猜想for _x in x被解释为对x执行python迭代器,而for i in range(n):被解释为C for循环,而x[i]被解释为C的x[i]数组索引.

I'm guessing that for _x in x is interpreted as execute the python iterator for x, and for i in range(n): is interpreted as a C for loop, and x[i] interpreted as C's x[i] array indexing.

但是,我有点猜测并试图以身作则. 在其与numpy一起使用(和

However, I'm kinda guessing and trying to follow by example. In its working with numpy (and here) docs, Cython is a little quiet on what's optimized with respect to numpy, and what's not. Is there a guide to what is optimized.

类似地,以下假设连续的数组内存,比以上任何一种都快得多.

Similarly, the following, which assumes contiguous array memory, is considerably faster that either of the above.

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
def c_iseq_f(np.ndarray[np.double_t, ndim=1, cast=False, mode="c"] x not None, double val):
    cdef np.ndarray[np.double_t, ndim=1] result = np.empty_like(x)
    cdef size_t i = 0

    cdef double* _xp = &x[0]
    cdef double* _resultp = &result[0]
    for i in range(len(x)):
        _x = _xp[i]
        _resultp[i] = (_x-_x) + (_x==val)
    return result

推荐答案

之所以感到惊讶,是因为x[i]看起来更加微妙.让我们看一下以下cython函数:

The reason for this surprise is that x[i] is more subtle as it looks. Let's take a look at the following cython function:

%%cython
def cy_sum(x):
   cdef double res=0.0
   cdef int i
   for i in range(len(x)):
         res+=x[i]
   return res

并评估其性能:

import numpy as np
a=np.random.random((2000,))
%timeit cy_sum(a)

>>>1000 loops, best of 3: 542 µs per loop

这太慢了!如果查看生成的C代码,您会发现x[i]使用__getitem()__功能,该功能使用C-double,创建python-Float对象,将其注册到垃圾收集器中,然后将其转换回C-double并销毁临时python-float.单个double加法的开销很大!

This is pretty slow! If you look into the produced C-code, you will see, that x[i] uses the __getitem()__ functionality, which takes a C-double, creates a python-Float object, registers it in the garbage collector, casts it back to a C-double and destroys the temporary python-float. Pretty much overhead for a single double-addition!

让我们清楚地告诉cython,x是类型化的内存视图:

Let's make it clear to cython, that x is a typed memory view:

%%cython
def cy_sum_memview(double[::1] x):
   cdef double res=0.0
   cdef int i
   for i in range(len(x)):
         res+=x[i]
   return res

具有更好的性能:

%timeit cy_sum_memview(a)   
>>> 100000 loops, best of 3: 4.21 µs per loop

那怎么了?因为cython知道,所以x键入的内存视图(我更愿意在cython函数的签名中使用类型化的内存视图而不是numpy-array),它不再必须使用python-functionity __getitem__,而是可以直接访问C-double值,而无需创建中间python-float.

So what happened? Because cython know, that x is a typed memory view (I would rather use typed memory view than numpy-array in the signature of the cython-functions), it no longer must use the python-functionality __getitem__ but can access the C-double value directly without the need to create an intermediate python-float.

但是回到numpy-arrays. cython可以将Numpy数组解释为类型化的内存视图,因此x[i]可以转换为对基础内存的直接/快速访问.

But back to the numpy-arrays. Numpy arrays can be intepreted by cython as typed memory views and thus x[i] can be translated into a direct/fast access to the underlying memory.

那么范围范围呢?

%%cython
cimport array
def cy_sum_memview_for(double[::1] x):
    cdef double res=0.0
    cdef double x_
    for x_ in x:
          res+=x_
    return res

%timeit cy_sum_memview_for(a)
>>> 1000 loops, best of 3: 736 µs per loop

又慢了.因此,cython似乎不够聪明,无法通过直接/快速访问来替换for-range,并再次使用python功能以及由此产生的开销.

It is slow again. So cython seems not to be clever enough to replace the for-range through direct/fast access and once again uses python-functionality with the resulting overhead.

我必须承认,我和您一样感到惊讶,因为乍一看,没有充分的理由说明cython在for-range情况下不应该使用快速访问.但这就是事实...

I'm must confess I'm as surprised as you are, because at first sight there is no good reason why cython should not be able to use fast access in the case of the for-range. But this is how it is...

我不确定,这是原因,但是使用二维数组的情况并不是那么简单.考虑以下代码:

I'm not sure, that this is the reason but the situation is not that simple with two dimensional arrays. Consider the following code:

import numpy as np
a=np.zeros((5,1), dtype=int)
for d in a:
    print(int(d)+1)

此代码有效,因为d是1长度的数组,因此可以通过int(d)转换为Python标量.

This code works, because d is a 1-length array and thus can be be converted to Python scalar via int(d).

但是

for d in a.T:
    print(int(d)+1)

抛出,因为现在d的长度为5,因此无法将其转换为Python标量.

throws, because now d's length is 5 and thus it cannot be converted to a Python scalar.

因为我们希望这段代码在被cythonized时具有与纯Python相同的行为,并且只能在运行时确定向int的转换是否确定,所以我们首先将Python对象用于d我们将无法访问该数组的内容.

Because we want this code have the same behavior as pure Python when cythonized and it can be determined only during the runtime whether the conversion to int is Ok or not, we have use a Python-object for d first and only than can we access the content of this array.

这篇关于cython会执行哪些numpy优化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆