Python性能特点 [英] Python performance characteristics

查看:46
本文介绍了Python性能特点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在调整我的一个宠物项目以提高其性能.我已经使用了分析器来识别热点,但我认为更好地理解 Python 的性能特征会在未来非常有用.

我想知道一些事情:

它的优化器有多聪明?

一些现代编译器拥有非常聪明的优化器,这些优化器通常可以使用简单的代码并使其运行速度比任何人工调整代码的尝试都快.根据优化器的智能程度,我的代码愚蠢"可能要好得多.

虽然 Python 是一种解释型"语言,但它似乎确实可以编译为某种形式的字节码 (.pyc).这样做有多聪明?

  • 它会折叠常量吗?
  • 它会内联小函数还是展开短循环?
  • 它会执行复杂的数据/流分析吗?我没有资格解释清楚.

以下操作的速度(比较)

  • 函数调用
  • 类实例化
  • 算术
  • 较重"的数学运算,例如 sqrt()

内部如何处理数字?

如何在 Python 中存储数字.它们是在内部存储为整数/浮点数还是作为字符串移动?

NumPy

NumPy 能带来多大的性能差异?此应用程序大量使用向量和相关数学.使用它来加速这些操作可以产生多大的不同.

其他有趣的事情

如果您能想到其他值得了解的内容,请随时提及.

一些背景...

因为有一些人提出了先看看你的算法"的建议(这是非常明智的建议,但并没有真正帮助我提出这个问题)我会在这里添加一些关于什么的继续,为什么我要问这个.

有问题的宠物项目是一个用 Python 编写的光线追踪器.现在还不是很远,目前只是针对场景中的两个对象(一个三角形和一个球体)进行命中测试.不执行着色、阴影或照明计算.算法基本上是:

对于图像中的每个 x, y 位置:创建射线命中测试与球体命中测试与三角形根据最近的对象为像素着色,如果没有命中则为黑色.

光线追踪中的算法改进通常通过尽早消除场景中的对象来实现.它们可以为复杂的场景提供相当大的提升,但如果这个光线追踪器不能在不费力的情况下对仅仅两个物体进行测试,那么它根本无法处理.

虽然我意识到基于 Python 的光线跟踪器无法达到基于 C 的光线跟踪器的性能,但考虑到像 Arauna 可以在我的计算机上管理 15-20 FPS 以 640x480 渲染相当复杂的场景,我希望在 Python 中渲染一个非常基本的 500x500 图像在下面是可行的一秒钟.

目前,我的代码需要 38 秒.在我看来,它真的不应该花那么长时间.

分析显示了这些形状的实际命中测试例程所花费的大部分时间.这在光线追踪器中并不特别令人惊讶,这也是我所期望的.这些命中测试的调用计数是每个 250,000(准确地说是 500x500),这表明它们被调用的频率与应有的完全一样.这是 3% 建议优化的漂亮教科书案例.

我计划在改进代码时做完整的计时/测量工作.然而,如果没有对 Python 中的成本有什么基本了解,我试图调整我的代码将无异于在黑暗中磕磕绊绊.我认为获得一点知识来照亮道路对我很有帮助.

解决方案

Python 的编译器非常简单——这使它快速且高度可预测.除了一些恒定的折叠之外,它基本上会生成忠实地模仿您的来源的字节码.其他人已经建议 dis,这确实是一个不错的选择查看您获得的字节码的方式 - 例如, for i in [1, 2, 3]: 实际上不是在进行常量折叠而是动态生成文字列表,while for i in (1, 2, 3):(在文字元组而不是文字列表上循环)能够恒定折叠(原因:列表是一个可变对象,并且为了保持简单"的使命陈述,编译器不会费心去检查这个特定的列表是否从未被修改过,所以它可以被优化成一个元组).

因此有足够的空间进行手动微优化——尤其是提升.即,重写

for x in what():anobj.method(x)

作为

f = anobj.method对于 x in what():f(x)

保存重复查找(编译器不会检查 anobj.method 的运行是否可以实际更改 anobj 的绑定 &c下次需要查找——它只是做非常简单的事情,即没有提升,这保证了正确性,但绝对不能保证超快的速度;-)

timeit 模块(最好用于shell 提示恕我直言)使测量编译 + 字节码解释的整体效果变得非常简单(只需确保您正在测量的代码段没有影响时间的副作用,因为 timeit 确实如此 在循环中一遍又一遍地运行它;-).例如:

$ python -mtimeit 'for x in (1, 2, 3): pass'1000000 个循环,最好的 3 个:每个循环 0.219 微秒$ python -mtimeit 'for x in [1, 2, 3]: pass'1000000 个循环,最好的 3 个:每个循环 0.512 微秒

您可以看到重复列表构建的成本——并通过尝试微调来确认这确实是我们观察到的:

$ python -mtimeit -s'Xs=[1,2,3]' 'for x in Xs: pass'1000000 个循环,最好的 3 个:每个循环 0.236 微秒$ python -mtimeit -s'Xs=(1,2,3)' 'for x in Xs: pass'1000000 个循环,最好的 3 个:每个循环 0.213 微秒

将 iterable 的构造移动到 -s 设置(仅运行一次且不计时)表明,元组上的循环本身稍微快一些(可能是 10%),但最大的问题是第一对(list 比 tuple 慢 100% 以上)主要与构造有关.

有了 timeit 和编译器在优化中故意非常简单的知识,我们可以轻松回答您的其他问题:

<块引用>

以下操作有多快(比较)

* 函数调用* 类实例化* 算术* '较重'的数学运算,例如 sqrt()

$ python -mtimeit -s'def f(): pass' 'f()'10000000 个循环,最好的 3 个:每个循环 0.192 微秒$ python -mtimeit -s'class o: pass' 'o()'1000000 个循环,最好的 3 个:每个循环 0.315 微秒$ python -mtimeit -s'class n(object): pass' 'n()'10000000 个循环,最好的 3 个:每个循环 0.18 微秒

所以我们看到:实例化一个新样式的类和调用一个函数(都是空的)的速度差不多,实例化可能有很小的速度余量,可能只有 5%;实例化旧式类最慢(大约 50%).5% 或更少的微小差异当然可能是噪音,因此建议每次尝试重复几次;但像 50% 这样的差异绝对远远超出噪音.

$ python -mtimeit -s'from math import sqrt' 'sqrt(1.2)'1000000 个循环,最好的 3 个:每个循环 0.22 微秒$ python -mtimeit '1.2**0.5'10000000 个循环,最好的 3 个:每个循环 0.0363 微秒$ python -mtimeit '1.2*0.5'10000000 个循环,最好的 3 个:每个循环 0.0407 微秒

在这里我们看到:调用 sqrt 比通过操作符(使用 ** raise-to-power 操作符)执行相同的计算要慢,其成本大约是调用一个空函数;所有算术运算符在噪声范围内的速度大致相同(3 或 4 纳秒的微小差异绝对是噪声;-).检查常量折叠是否会干扰:

$ python -mtimeit '1.2*0.5'10000000 个循环,最好的 3 个:每个循环 0.0407 微秒$ python -mtimeit -s'a=1.2;b=0.5''a*b'10000000 个循环,最好的 3 个:每个循环 0.0965 微秒$ python -mtimeit -s'a=1.2;b=0.5''a*0.5'10000000 个循环,最好的 3 个:每个循环 0.0957 微秒$ python -mtimeit -s'a=1.2;b=0.5''1.2*b'10000000 个循环,最好的 3 个:每个循环 0.0932 微秒

...我们看到情况确实如此:如果将其中一个或两个数字作为变量查找(这会阻止常量折叠),我们将支付现实"成本.变量查找有其自身的成本:

$ python -mtimeit -s'a=1.2;b=0.5''a'10000000 个循环,最好的 3 个:每个循环 0.039 微秒

无论如何,当我们试图测量如此微小的时间时,这远不能忽略不计.事实上,持续查找也不是免费的:

$ python -mtimeit -s'a=1.2;b=0.5''1.2'10000000 个循环,最好的 3 个:每个循环 0.0225 微秒

如您所见,虽然比变量查找小,但它相当——大约一半.

如果(经过仔细分析和测量)您认为某些计算核心迫切需要优化,我建议您尝试 cython——它是一个 C/Python 合并,它试图像 Python 一样整洁,像 C 一样快,虽然它不能 100% 到达那里,但它肯定会很好地掌握它(特别是它生成的二进制代码比使用其前身语言 pyrex,以及比它更丰富一点).对于最后百分之几的性能,您可能仍然希望使用 C(或在某些特殊情况下使用汇编/机器代码),但这真的非常罕见.

I'm in the process of tuning a pet project of mine to improve its performance. I've already busted out the profiler to identify hotspots but I'm thinking understanding Pythons performance characteristics a little better would be quite useful going forward.

There are a few things I'd like to know:

How smart is its optimizer?

Some modern compilers have been blessed with remarkably clever optimisers, that can often take simple code and make it run faster than any human attempts at tuning the code. Depending on how smart the optimizer is, it may be far better for my code to be 'dumb'.

While Python is an 'interpreted' language, it does appear to compile down to some form of bytecode (.pyc). How smart is it when it does this?

  • Will it fold constants?
  • Will it inline small functions or unroll short loops?
  • Will it perform complex data/flow analysis I'm not qualified to explain properly.

How fast are the following operations (comparatively)

  • Function calls
  • Class instantiation
  • Arithmetic
  • 'Heavier' math operations such as sqrt()

How are numbers handled internally?

How are numbers stored within Python. Are they stored as integers / floats internally or moved around as a string?

NumPy

How much of a performance difference can NumPy make? This application makes heavy use of Vectors and related mathematics. How much of a difference can be made by using this to accelerate these operations.

Anything else interesting

If you can think of anything else worth knowing, feel free to mention it.

Some background...

Since there are a few people bringing in the 'look at your algorithms first' advice (which is quite sensible advice, but doesn't really help with my purpose in asking this question) I'll add a bit here about whats going on, and why I'm asking about this.

The pet project in question is a ray-tracer written in Python. It's not yet very far along and currently just hit tests against two objects (a triangle and a sphere) within the scene. No shading, shadowing or lighting calculations are being performed. The algorithm is basically:

for each x, y position in the image:
     create a ray
     hit test vs. sphere
     hit test vs. triangle
     colour the pixel based on the closest object, or black if no hit.

Algorithmic refinements in ray-tracing generally work by eliminating objects in the scene early. They can provide a considerable boost for complex scenes, but if this ray-tracer can't hit test against a mere two objects without struggling, then it's not going to be able to handle very much at all.

While I realise that a Python based ray-tracer won't quite be able to reach the performance of a C based one, given that real-time ray tracers like Arauna can manage 15-20 FPS on my computer rendering reasonably complex scenes at 640x480, I'd expect rendering a very basic 500x500 image in Python to be doable in under a second.

Currently, my code is taking 38 seconds. It seems to me like it really shouldn't take that long.

Profiling shows the bulk of the time being spent in the actual hit testing routines for these shapes. This isn't particularly surprising in a ray tracer, and what I expected. The call-counts for these hit tests are each 250,000 (500x500 exactly), which would indicate they're being called exactly as often as they should be. This is a a pretty text-book case of the 3% where optimization is advisable.

I'm planning on doing the full timing / measuring thing as I work on improving the code. However, without some basic knowledge of what costs what in Python my attempts to tune my code would be little more than stumbling in the dark. I figured it would serve me well to gain a little knowledge to light the way.

解决方案

Python's compiler is deliberately dirt-simple -- this makes it fast and highly predictable. Apart from some constant folding, it basically generates bytecode that faithfully mimics your sources. Somebody else already suggested dis, and it's indeed a good way to look at the bytecode you're getting -- for example, how for i in [1, 2, 3]: isn't actually doing constant folding but generating the literal list on the fly, while for i in (1, 2, 3): (looping on a literal tuple instead of a literal list) is able to constant-fold (reason: a list is a mutable object, and to keep to the "dirt-simple" mission statement the compiler doesn't bother to check that this specific list is never modified so it could be optimized into a tuple).

So there's space for ample manual micro-optimization -- hoisting, in particular. I.e., rewrite

for x in whatever():
    anobj.amethod(x)

as

f = anobj.amethod
for x in whatever():
    f(x)

to save the repeated lookups (the compiler doesn't check whether a run of anobj.amethod can actually change anobj's bindings &c so that a fresh lookup is needed next time -- it just does the dirt-simple thing, i.e., no hoisting, which guarantees correctness but definitely doesn't guarantee blazing speed;-).

The timeit module (best used at a shell prompt IMHO) makes it very simple to measure the overall effects of compilation + bytecode interpretation (just ensure the snippet you're measuring has no side effects that would affect the timing, since timeit does run it over and over in a loop;-). For example:

$ python -mtimeit 'for x in (1, 2, 3): pass'
1000000 loops, best of 3: 0.219 usec per loop
$ python -mtimeit 'for x in [1, 2, 3]: pass'
1000000 loops, best of 3: 0.512 usec per loop

you can see the costs of the repeated list construction -- and confirm that is indeed what we're observing by trying a minor tweak:

$ python -mtimeit -s'Xs=[1,2,3]' 'for x in Xs: pass'
1000000 loops, best of 3: 0.236 usec per loop
$ python -mtimeit -s'Xs=(1,2,3)' 'for x in Xs: pass'
1000000 loops, best of 3: 0.213 usec per loop

moving the iterable's construction to the -s setup (which is run only once and not timed) shows that the looping proper is slightly faster on tuples (maybe 10%), but the big issue with the first pair (list slower than tuple by over 100%) is mostly with the construction.

Armed with timeit and the knowledge that the compiler's deliberately very simple minded in its optimizations, we can easily answer other questions of yours:

How fast are the following operations (comparatively)

* Function calls
* Class instantiation
* Arithmetic
* 'Heavier' math operations such as sqrt()

$ python -mtimeit -s'def f(): pass' 'f()'
10000000 loops, best of 3: 0.192 usec per loop
$ python -mtimeit -s'class o: pass' 'o()'
1000000 loops, best of 3: 0.315 usec per loop
$ python -mtimeit -s'class n(object): pass' 'n()'
10000000 loops, best of 3: 0.18 usec per loop

so we see: instantiating a new-style class and calling a function (both empty) are about the same speed, with instantiations possibly having a tiny speed margin, maybe 5%; instantiating an old-style class is slowest (by about 50%). Tiny differences such as 5% or less of course could be noise, so repeating each try a few times is advisable; but differences like 50% are definitely well beyond noise.

$ python -mtimeit -s'from math import sqrt' 'sqrt(1.2)'
1000000 loops, best of 3: 0.22 usec per loop
$ python -mtimeit '1.2**0.5'
10000000 loops, best of 3: 0.0363 usec per loop
$ python -mtimeit '1.2*0.5'
10000000 loops, best of 3: 0.0407 usec per loop

and here we see: calling sqrt is slower than doing the same computation by operator (using the ** raise-to-power operator) by roughly the cost of calling an empty function; all arithmetic operators are roughly the same speed to within noise (the tiny difference of 3 or 4 nanoseconds is definitely noise;-). Checking whether constant folding might interfere:

$ python -mtimeit '1.2*0.5'
10000000 loops, best of 3: 0.0407 usec per loop
$ python -mtimeit -s'a=1.2; b=0.5' 'a*b'
10000000 loops, best of 3: 0.0965 usec per loop
$ python -mtimeit -s'a=1.2; b=0.5' 'a*0.5'
10000000 loops, best of 3: 0.0957 usec per loop
$ python -mtimeit -s'a=1.2; b=0.5' '1.2*b'
10000000 loops, best of 3: 0.0932 usec per loop

...we see that this is indeed the case: if either or both numbers are being looked up as variables (which blocks constant folding), we're paying the "realistic" cost. Variable lookup has its own cost:

$ python -mtimeit -s'a=1.2; b=0.5' 'a'
10000000 loops, best of 3: 0.039 usec per loop

and that's far from negligible when we're trying to measure such tiny times anyway. Indeed constant lookup isn't free either:

$ python -mtimeit -s'a=1.2; b=0.5' '1.2'
10000000 loops, best of 3: 0.0225 usec per loop

as you see, while smaller than variable lookup it's quite comparable -- about half.

If and when (armed with careful profiling and measurement) you decide some nucleus of your computations desperately need optimization, I recommend trying cython -- it's a C / Python merge which tries to be as neat as Python and as fast as C, and while it can't get there 100% it surely makes a good fist of it (in particular, it makes binary code that's quite a bit faster than you can get with its predecessor language, pyrex, as well as being a bit richer than it). For the last few %'s of performance you probably still want to go down to C (or assembly / machine code in some exceptional cases), but that would be really, really rare.

这篇关于Python性能特点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆