使用Cython优化简单的CPU绑定循环并替换列表 [英] Optimizing simple CPU-bound loops using Cython and replacing a list
问题描述
我正在尝试评估一些方法,并且在性能上遇到了绊脚石.
I am trying to evaluate some approaches, and I'm hitting a stumbling block with performance.
为什么我的cython代码这么慢?我的期望是,该代码的运行时间将比毫秒快得多(对于只有256个** 2项的2d循环,可能运行十亿分之一秒).
Why is my cython code so slow?? My expectation is that the code would run quite a bit faster (maybe nano seconds for a 2d loop with only 256 ** 2 entries) as opposed to milliseconds.
这是我的测试结果:
$ python setup.py build_ext --inplace; python test.py
running build_ext
counter: 0.00236220359802 sec
pycounter: 0.00323309898376 sec
percentage: 73.1 %
我的初始代码如下:
#!/usr/bin/env python
# encoding: utf-8
# filename: loop_testing.py
def generate_coords(dim, length):
"""Generates a list of coordinates from dimensions and size
provided.
Parameters:
dim -- dimension
length -- size of each dimension
Returns:
A list of coordinates based on dim and length
"""
values = []
if dim == 2:
for x in xrange(length):
for y in xrange(length):
values.append((x, y))
if dim == 3:
for x in xrange(length):
for y in xrange(length):
for z in xrange(length):
values.append((x, y, z))
return values
这可以满足我的需求,但是速度很慢.对于给定的dim,length =(2,256),我在iPython上看到的时间约为2.3ms.
This works for what I need, but is slow. For a given dim, length = (2, 256), I see a timing on iPython of approximately 2.3ms.
为了加快速度,我开发了一个cython等效项(我认为这是等效项).
In an attempt to speed this up, I developed a cython equivalent (I think it's an equivalent).
#!/usr/bin/env python
# encoding: utf-8
# filename: loop_testing.pyx
# cython: boundscheck=False
# cython: wraparound=False
cimport cython
from cython.parallel cimport prange
import numpy as np
cimport numpy as np
ctypedef int DTYPE
# 2D point updater
cpdef inline void _counter_2d(DTYPE[:, :] narr, int val) nogil:
cdef:
DTYPE count = 0
DTYPE index = 0
DTYPE x, y
for x in range(val):
for y in range(val):
narr[index][0] = x
narr[index][1] = y
index += 1
cpdef DTYPE[:, :] counter(dim=2, val=256):
narr = np.zeros((val**dim, dim), dtype=np.dtype('i4'))
_counter_2d(narr, val)
return narr
def pycounter(dim=2, val=256):
vals = []
for x in xrange(val):
for y in xrange(val):
vals.append((x, y))
return vals
计时的调用:
#!/usr/bin/env python
# filename: test.py
"""
Usage:
test.py [options]
test.py [options] <val>
test.py [options] <dim> <val>
Options:
-h --help This Message
-n Number of loops [default: 10]
"""
if __name__ == "__main__":
from docopt import docopt
from timeit import Timer
args = docopt(__doc__)
dim = args.get("<dim>") or 2
val = args.get("<val>") or 256
n = args.get("-n") or 10
dim = int(dim)
val = int(val)
n = int(n)
tests = ['counter', 'pycounter']
timing = {}
for test in tests:
code = "{}(dim=dim, val=val)".format(test)
variables = "dim, val = ({}, {})".format(dim, val)
setup = "from loop_testing import {}; {}".format(test, variables)
t = Timer(code, setup=setup)
timing[test] = t.timeit(n) / n
for test, val in timing.iteritems():
print "{:>20}: {} sec".format(test, val)
print "{:>20}: {:>.3} %".format("percentage", timing['counter'] / timing['pycounter'] * 100)
作为参考,用于建立cython代码的setup.py:
And for reference, the setup.py to build the cython code:
from distutils.core import setup
from Cython.Build import cythonize
import numpy
include_path = [numpy.get_include()]
setup(
name="looping",
ext_modules=cythonize('loop_testing.pyx'), # accepts a glob pattern
include_dirs=include_path,
)
链接到工作版本: https://github.com/brianbruggeman/cython_experimentation
Link to working version: https://github.com/brianbruggeman/cython_experimentation
推荐答案
由于narr[index][0] = x
分配非常依赖Python C-API,因此此Cython代码很慢.使用narr[index, 0] = x
代替,它将转换为纯C,并解决了此问题.
This Cython code was slow because of the narr[index][0] = x
assignment, which relies heavily on Python C-API. Using, narr[index, 0] = x
instead, is translated to pure C, and solves this issue.
@perimosocordiae指出,将cythonize
与注释一起使用绝对是调试此类问题的方法.
As pointed out by @perimosocordiae, using cythonize
with annotations is definitely the way to go to debug such issues.
在某些情况下,值得在setup.py
中为gcc明确指定编译标志
In some cases it can also be worth explicitly specifying compilation flags in setup.py
for gcc,
setup(
[...]
extra_compile_args=['-O2', '-march=native'],
extra_link_args=['-O2', '-march=native'])
假设合理的默认编译标志,则这没有必要.但是,例如,在我的Linux系统上,默认似乎根本没有优化,并且添加了以上标志,从而导致了显着的性能改进.
This should not be necessary, assuming reasonable default compilation flags. However, for instance, on my Linux system the default appear to be no optimization at all and adding the above flags, results in a significant performance improvement.
这篇关于使用Cython优化简单的CPU绑定循环并替换列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!