为什么使用中间变量的代码要比不使用中间变量的代码快? [英] Why is code using intermediate variables faster than code without?

查看:50
本文介绍了为什么使用中间变量的代码要比不使用中间变量的代码快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了这种奇怪的行为,但无法解释。这些是基准:

I have encountered this weird behavior and failed to explain it. These are the benchmarks:

py -3 -m timeit "tuple(range(2000)) == tuple(range(2000))"
10000 loops, best of 3: 97.7 usec per loop
py -3 -m timeit "a = tuple(range(2000));  b = tuple(range(2000)); a==b"
10000 loops, best of 3: 70.7 usec per loop

为什么与使用变量赋值进行比较要比使用带有临时变量的衬里函数快27%以上?

How come comparison with variable assignment is faster than using a one liner with temporary variables by more than 27%?

通过Python文档,垃圾回收在其时间段内处于禁用状态,因此可以就是那样。是某种优化吗?

By the Python docs, garbage collection is disabled during timeit so it can't be that. Is it some sort of an optimization?

结果也可以在Python 2.x中复制,尽管程度较小。

The results may also be reproduced in Python 2.x though to lesser extent.

运行Windows 7,CPython 3.5.1,Intel i7 3.40 GHz,64位OS和Python。似乎是我尝试在使用Python 3.5.0的Intel i7 3.60 GHz上运行的另一台机器无法重现结果。

Running Windows 7, CPython 3.5.1, Intel i7 3.40 GHz, 64 bit both OS and Python. Seems like a different machine I've tried running at Intel i7 3.60 GHz with Python 3.5.0 does not reproduce the results.

使用相同的Python进程和 timeit.timeit()在10000个循环中运行分别产生0.703和0.804。仍然显示,尽管程度较小。 (〜12.5%)

Running using the same Python process with timeit.timeit() @ 10000 loops produced 0.703 and 0.804 respectively. Still shows although to lesser extent. (~12.5%)

推荐答案

我的结果与您的相似:使用中间变量的代码至少一致地保持10-20在Python 3.4中提高了%。但是,当我在完全相同的Python 3.4解释器上使用IPython时,得到了以下结果:

My results were similar to yours: the code using intermediate variables was pretty consistently at least 10-20 % faster in Python 3.4. However when I used IPython on the very same Python 3.4 interpreter, I got these results:

In [1]: %timeit -n10000 -r20 tuple(range(2000)) == tuple(range(2000))
10000 loops, best of 20: 74.2 µs per loop

In [2]: %timeit -n10000 -r20 a = tuple(range(2000));  b = tuple(range(2000)); a==b
10000 loops, best of 20: 75.7 µs per loop

当我从命令行使用 -mtimeit 时,我再也无法接近前者的74.2 µs。

Notably, I never managed to get even close to the 74.2 µs for the former when I used -mtimeit from the command line.

因此,这个Heisenbug变得非常有趣。我决定使用 strace 运行命令,确实确实发生了一些麻烦:

So this Heisenbug turned out to be something quite interesting. I decided to run the command with strace and indeed there is something fishy going on:

% strace -o withoutvars python3 -m timeit "tuple(range(2000)) == tuple(range(2000))"
10000 loops, best of 3: 134 usec per loop
% strace -o withvars python3 -mtimeit "a = tuple(range(2000));  b = tuple(range(2000)); a==b"
10000 loops, best of 3: 75.8 usec per loop
% grep mmap withvars|wc -l
46
% grep mmap withoutvars|wc -l
41149

现在,这是造成差异的一个很好的原因。不使用变量的代码导致 mmap 系统调用的调用量几乎是使用中间变量的调用的1000倍。

Now that is a good reason for the difference. The code that does not use variables causes the mmap system call be called almost 1000x more than the one that uses intermediate variables.

无变量充满了 mmap / munmap 对于256k区域;这些相同的行一遍又一遍地重复:

The withoutvars is full of mmap/munmap for a 256k region; these same lines are repeated over and over again:

mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
munmap(0x7f32e56de000, 262144)          = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
munmap(0x7f32e56de000, 262144)          = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
munmap(0x7f32e56de000, 262144)          = 0






mmap 调用似乎是来自 Objects / obmalloc.c 中的函数 _PyObject_ArenaMmap obmalloc.c 还包含宏 ARENA_SIZE ,它是 #define d为(256 << 10)(即 262144 );类似地, munmap obmalloc.c 中的 _PyObject_ArenaMunmap 相匹配。


The mmap call seems to be coming from the function _PyObject_ArenaMmap from Objects/obmalloc.c; the obmalloc.c also contains the macro ARENA_SIZE, which is #defined to be (256 << 10) (that is 262144); similarly the munmap matches the _PyObject_ArenaMunmap from obmalloc.c.

obmalloc.c 表示


在Python 2.5之前,竞技场从来没有 free() ed。从Python 2.5开始,
我们尝试尝试使用 free()竞技场,并使用一些温和的启发式策略来增加
竞技场最终成为现实的可能性。

Prior to Python 2.5, arenas were never free()'ed. Starting with Python 2.5, we do try to free() arenas, and use some mild heuristic strategies to increase the likelihood that arenas eventually can be freed.

因此,这些启发式方法以及Python对象分配器在清空后立即释放这些免费区域的事实导致 python3 -mtimeit'tuple(range(2000))== tuple(range(2000))'触发病理行为,其中一个256 kiB内存区域被重新分配并重复释放;并且这种分配发生在 mmap / munmap 上,因为它们是系统调用,所以相对来说比较昂贵-而且,<$带有 MAP_ANONYMOUS 的c $ c> mmap 要求必须将新映射的页面清零-即使Python不在乎。

Thus these heuristics and the fact that Python object allocator releases these free arenas as soon as they're emptied lead to python3 -mtimeit 'tuple(range(2000)) == tuple(range(2000))' triggering pathological behaviour where one 256 kiB memory area is re-allocated and released repeatedly; and this allocation happens with mmap/munmap, which is comparatively costly as they're system calls - furthermore, mmap with MAP_ANONYMOUS requires that the newly mapped pages must be zeroed - even though Python wouldn't care.

使用中间变量的代码中不存在此行为,因为它使用了更多的内存,并且由于某些对象仍在分配中而无法释放内存在里面。那是因为 timeit 将使其循环成一个循环。

The behaviour is not present in the code that uses intermediate variables, because it is using slightly more memory and no memory arena can be freed as some objects are still allocated in it. That is because timeit will make it into a loop not unlike

for n in range(10000)
    a = tuple(range(2000))
    b = tuple(range(2000))
    a == b

现在的行为是 a b 将保持绑定,直到被重新分配,因此在第二次迭代中, tuple(range(2000))将分配第三个元组,而赋值 a = tuple(...)将减少旧元组的引用计数,从而使其释放,并增加新元组的引用计数;然后 b 也一样。因此,在第一次迭代之后,这些元组中至少总是有2个,如果不是3个,那么就不会发生抖动。

Now the behaviour is that both a and b will stay bound until they're *reassigned, so in the second iteration, tuple(range(2000)) will allocate a 3rd tuple, and the assignment a = tuple(...) will decrease the reference count of the old tuple, causing it to be released, and increase the reference count of the new tuple; then the same happens to b. Therefore after the first iteration there are always at least 2 of these tuples, if not 3, so the thrashing doesn't occur.

最值得注意的是,不能保证使用中间变量的代码总是更快-实际上,在某些设置中,使用中间变量可能会导致额外的 mmap 调用,而直接比较返回值的代码可能没问题。

Most notably it cannot be guaranteed that the code using intermediate variables is always faster - indeed in some setups it might be that using intermediate variables will result in extra mmap calls, whereas the code that compares return values directly might be fine.

有人问为什么在 timeit 禁用时为什么会发生这种情况垃圾收集。 timeit 做到

Someone asked that why this happens, when timeit disables garbage collection. It is indeed true that timeit does it:


注意

默认情况下, timeit()会在计时期间临时关闭垃圾收集。这种方法的优势在于,它使独立计时更具可比性。这个缺点是GC可能是被测功能性能的重要组成部分。如果是这样,可以将GC作为设置字符串中的第一条语句重新启用。例如:

By default, timeit() temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. This disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string. For example:

但是,Python的垃圾收集器只能回收循环垃圾,即收集其引用形成循环的对象。这里不是这种情况。而是在引用计数降至零时立即释放这些对象。

However, the garbage collector of Python is only there to reclaim cyclic garbage, i.e. collections of objects whose references form cycles. It is not the case here; instead these objects are freed immediately when the reference count drops to zero.

这篇关于为什么使用中间变量的代码要比不使用中间变量的代码快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆