Cython实现不比纯python快 [英] Cython implementation no faster than pure python
问题描述
为练习,我编写了一个XOR双向链接列表
For an exercise I've written a XOR doubly-linked list
%%cython
from cpython.object cimport PyObject
from cpython.ref cimport Py_XINCREF, Py_XDECREF
from libc.stdint cimport uintptr_t
cdef class Node:
cdef uintptr_t _prev_xor_next
cdef object val
def __init__(self, object val, uintptr_t prev_xor_next=0):
self._prev_xor_next=prev_xor_next
self.val=val
@property
def prev_xor_next(self):
return self._prev_xor_next
@prev_xor_next.setter
def prev_xor_next(self, uintptr_t p):
self._prev_xor_next=p
def __repr__(self):
return str(self.val)
cdef class CurrentNode(Node):
cdef uintptr_t _node, _prev_ptr
def __init__(self, uintptr_t node, uintptr_t prev_ptr=0):
self._node = node
self._prev_ptr= prev_ptr
@property
def val(self):
return self.node.val
@property
def node(self):
ret=<PyObject *> self._node
return <Node> ret
@property
def prev_ptr(self):
return self._prev_ptr
cdef CurrentNode forward(self):
if self.node.prev_xor_next!=self._prev_ptr:
return CurrentNode(self.node.prev_xor_next^self._prev_ptr, self._node)
cdef CurrentNode backward(self):
if self._prev_ptr:
pp=<PyObject*>self._prev_ptr
return CurrentNode(self._prev_ptr, self._node^(<Node> pp).prev_xor_next)
def __repr__(self):
return str(self.node)
cdef class XORList:
cdef PyObject* first
cdef PyObject* last
cdef int length
def __init__(self):
self.length=0
@property
def head(self):
return (<Node> self.first)
@property
def tail(self):
return (<Node> self.last)
cdef append(self, object val):
self.length+=1
#empty list
if not self.first:
t=Node(val)
tp=(<PyObject*> t)
self.first=tp
Py_XINCREF(tp)
self.last=tp
Py_XINCREF(tp)
#not empty
else:
new_node=Node(val, <uintptr_t> self.last)
new_ptr=<PyObject*> new_node
cur_last=<Node>self.last
cur_last.prev_xor_next=cur_last.prev_xor_next^(<uintptr_t> new_ptr)
Py_XINCREF(new_ptr)
self.last=new_ptr
Py_XINCREF(new_ptr)
cpdef reverse(self):
temp=self.last
self.last=self.first
self.first=temp
def __repr__(self):
return str(list(iter_XORList(self)))
def __len__(self):
return self.length
def iter_XORList(l):
head=<PyObject*>l.head
cur=CurrentNode(<uintptr_t> head)
while cur:
yield cur
cur=cur.forward()
import time
start=time.time()
cdef XORList l=XORList()
for i in range(100000):
l.append(i)
print('time xor ', time.time()-start)
start=time.time()
l1=[]
for i in range(100000):
l1.append(i)
print('time regular ', time.time()-start)
使用上面的内置列表,在cython链接列表上,我的性能持续下降约10倍.
using the builtin list above I consistently get ~10x worse performance on the cython linked list.
time xor 0.10768294334411621
time regular 0.010972023010253906
当我分析xorlist的循环时,我得到:
When I profile the loop for the xorlist I get:
700003 function calls in 1.184 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.184 1.184 <string>:1(<module>)
1 0.039 0.039 1.184 1.184 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:108(list_check)
100000 0.025 0.000 0.025 0.000 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:11(__init__)
99999 0.019 0.000 0.019 0.000 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:16(__get__)
99999 0.018 0.000 0.018 0.000 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:19(__set__)
1 0.000 0.000 0.000 0.000 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:60(__init__)
100000 0.937 0.000 0.999 0.000 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:70(append)
100000 0.113 0.000 1.146 0.000 line_profiler.py:111(wrapper)
1 0.000 0.000 1.184 1.184 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100000 0.018 0.000 0.018 0.000 {method 'disable_by_count' of '_line_profiler.LineProfiler' objects}
100000 0.015 0.000 0.015 0.000 {method 'enable_by_count' of '_line_profiler.LineProfiler' objects}
因此,忽略对append
的调用,似乎大部分时间都花在了特殊方法上.
So, ignoring the calls to append
it seems most of the time is spent in the special methods.
这使我想到了问题:
- 我该如何加快速度
- 我认为Cython中的扩展类型是通过结构实现的,是什么导致它们的初始化花费了如此长的时间
我还尝试了使用纯python对普通双向链表进行另一种自定义实现,并且它与cython xorlist的计时在我的机器上相差10%以内.
I also tried another custom implementation of an oridnary doubly-linked list in pure python and the timings of it and the cython xorlist are similar within 10% difference on my machine.
推荐答案
配置文件中的三个罪魁祸首是Node的__init__
(在这里是不可避免的),而__get__
和__set__
是prev_xor_next
财产.我的观点是,您不希望使用prev_xor_next
属性(或者如果您将其设置为只读的话),因为它使应该是在Python中可以访问的Cython内部.
The three culprits from your profiling look to be Node's __init__
(which is unavoidable here), and __get__
and __set__
for the prev_xor_next
property. My view is that you don't want the prev_xor_next
property (or if you do it should be read-only) since it makes what should be a Cython internal accessible in Python.
无论是否删除该属性,都在这里使用Cython,以便可以直接写入基础C属性_prev_xor_next
.您可能需要在append
的开头(以及可能在其他功能中)设置cdef Node cur_last
,以确保Cython知道cur_last
的类型-我认为它应该可以解决,但是如果您得到
Whether you delete the property or not, you are working in Cython here so you can write directly to the underlying C attribute _prev_xor_next
. You may need to set cdef Node cur_last
at the start of append
(and maybe in other functions) to ensure that Cython knows the type of cur_last
- I think it should be able to work it out but if you get AttributeErrors
at runtime then this is what you need to do.
此更改使我的速度提高了30%(即,它的速度仍然比常规列表慢,但这是一个明显的改进).
This change gives me a 30% speed increase (i.e. it's still slower than a regular list, but it's a noticeable improvement).
我将概述一个更剧烈的更改,我可能应该在您关于此问题的第一个问题上提出该更改.这确实是一个模糊的大纲,因此没有为使其工作而作任何努力...
I'll outline a more drastic change that I possibly should have suggested on your first question about this problem. This really is a vague outline so no effort has been made to get it to work...
-
Node
完全在您的XORList
类内部:不应在Python中使用它,并且XORList
中所有Nodes
的生存期都直接与该列表绑定.因此,应该在销毁自己的XORList
时破坏它们(或者如果列表缩小,等等),因此不必进行引用计数.因此Node
应该是C结构而不是Python对象:
Node
is entirely internal to yourXORList
class: it should not be used from Python and the lifetime of all theNodes
in anXORList
is tied directly to the list. Therefore they should be destructed on the destruction of their owningXORList
(or if the list shrinks, etc) and so do not need to be reference counted. ThereforeNode
should be a C struct rather than a Python object:
cdef struct Node:
uintptr_t prev_xor_next
PyObject* val
# with associated constructor- and destructor-like functions:
cdef Node* make_node(object val, uintptr_t prev_xor_next):
cdef Node* n = <Node*>malloc(sizeof(Node))
n.val = <PyObject*>val
Py_XINCREF(n.val)
n.prev_xor_next = prev_xor_next
return n
cdef void destroy_node(Node* n):
Py_XDECREF(n.val)
free(n)
XORList
需要一个__dealloc__
函数,该函数循环遍历在每个Node
上调用destroy_node
的列表(在您的版本中也需要一个__dealloc__
函数!)
XORList
needs a __dealloc__
function that loops through the list calling destroy_node
on each Node
(it needs a __dealloc__
function anyway in your version too!)
CurrentNode
需要保留Cython类,因为这是您的迭代器"界面.它显然不再可以从Node
继承.我将其更改为:
CurrentNode
needs to remain a Cython class, since this is your "iterator" interface. It can obviously no longer inherit from Node
. I'd change it to:
cdef class XORListIterator:
cdef Node* current_node
cdef XORList our_list
our_list
属性的作用是确保XORList
至少与CurrentNode
一样长-如果最终得到的XORList
迭代器不再存在, current_node
属性将无效. current_node
不是XORListIterator
的所有者,因此不需要析构函数.
the point of the attribute our_list
is to ensure that the XORList
is kept alive at least as long as the CurrentNode
- if you end up with an iterator for an XORList
that no longer exists that the current_node
attribute will be invalid. current_node
is not owned by XORListIterator
so no need for a destructor.
我认为这种方案的危险在于,确保对XORList
所做的任何更改都不会完全使任何现有的XORListIterators
失效,直至崩溃.我怀疑这也是您当前版本的问题.
The danger with this scheme I think is making sure that if any changes to the XORList
don't completely invalidate any existing XORListIterators
to the point where you get crashes. I suspect this would also be an issue with your current version.
我怀疑内置的list
仍然会保持竞争力,因为它是一种编写良好,高效的结构.请记住,list.append
通常是一个简单的Py_INCREF
,偶尔会进行数组重新分配和复制.您总是需要创建一个新的Python对象(Node
)以及一些相关的引用计数.
I suspect the built-in list
will still remain competitive, since it is a well-written, efficient structure. Remember that list.append
is usually a simple Py_INCREF
, with an occasional array reallocation and copy. Yours always involves creation of a new Python object (the Node
) as well as some associated reference counting.
我的替代方案避免了很多引用计数(在计算时间和您必须考虑它"时间方面),因此我希望它会更紧密.它保留了每个append
的内存分配较小的缺点,这对于链表结构是不可避免的.
My alternative scheme avoids a lot of reference counting (both in terms of computational time and "you having to think about it" time), so I'd expect it to be much closer. It retain the disadvantage of a small memory allocation each append
, which is unavoidable for a linked-list structure.
附录:解决有关"Cython类的便利性"的评论.在我看来,使用Cython类和struct的两个优点是:
Addendum: to address the comment about "the convenience of a Cython class". In my view the two advantages of using a Cython class vs a struct are:
- 您可以获得与结构相当接近的内容,但不必担心C指针,并且引用计数已得到处理.很明显,对于这个问题,您对指针做了奇怪的事情,并且必须显式地处理引用计数,因此我认为这不适用于您.
- 您可以从Python使用它-您不仅限于Cython.在这种情况下,我认为这完全是
XORList
的实现细节,不应向Python用户公开.
- You get something fairly close to a struct, but don't have to worry about C pointers and the reference counting is taken care of. It's pretty clear that for this problem you're doing odd things to pointers and having to handle reference counting explicitly, so I don't think this is applies to you.
- You can use it from Python - you aren't just restricted to Cython. In this case I think it's entirely an implementation detail of the
XORList
that shouldn't be exposed to Python users.
因此,我认为专门使用Cython类的主要原因不适用于您的问题. (当然,对于很多代码来说,优点确实适用!)
Therefore I think the main reasons to use Cython classes specifically don't apply to your problem. (For a lot of code the advantages do apply, of course!)
也许值得补充的是,构造Cython类可能是其较慢的功能之一-为了支持可能的继承,构造过程是间接的".您已经成功创建了一个基准,该基准几乎是所有构建的基准-我想这是一个稍微偏斜的基准,实际情况可能还不错.
It's probably also worth adding that constructing Cython classes is probably one of their slower features - to support possible inheritance the construction process is quite "indirect". You've managed to create a benchmark that turns out to be almost all constructing - I'd guess it's a slightly skewed benchmark and the real case might not be that bad.
这篇关于Cython实现不比纯python快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!