有没有办法规避的Python list.append()越来越慢成为一个循环的名单不断增加? [英] Is there a way to circumvent Python list.append() becoming progressively slower in a loop as the list grows?

查看:3220
本文介绍了有没有办法规避的Python list.append()越来越慢成为一个循环的名单不断增加?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的文件,我从阅读,并且每几行转换为对象的实例。

I have a big file I'm reading from, and convert every few lines to an instance of an Object.

由于我循环通过文件,我藏的实例列表使用list.append(实例),然后继续循环。

Since I'm looping through the file, I stash the instance to a list using list.append(instance), and then continue looping.

这是周围的100MB〜所以它不是过大的文件,但作为列表变大时,循环减慢逐步。 (我打印在循环中的每圈的时间)。

This is a file that's around ~100MB so it isn't too large, but as the list grows larger, the looping slows down progressively. (I print the time for each lap in the loop).

这是不是内在的循环〜当我通过文件打印的每一个新的实例作为我循环,程序进行恒速〜只有当我把它们追加到它得到缓慢的列表。

This is not intrinsic to the loop ~ when I print every new instance as I loop through the file, the program progresses at constant speed ~ it is only when I append them to a list it gets slow.

我的朋友建议while循环前禁止垃圾收集并启用它之后&安培;使垃圾收集调用。

My friend suggested disabling garbage collection before the while loop and enabling it afterward & making a garbage collection call.

难道别人观察与list.append越来越慢了类似的问题?是否有任何其他方式来规避这一点?

Did anyone else observe a similar problem with list.append getting slower? Is there any other way to circumvent this?

我来试试以下两件事以下建议。

I'll try the following two things suggested below.

(1)pre-分配的内存〜什么是做到这一点的最好方法是什么? (2)尝试使用双端

(1) "pre-allocating" the memory ~ what's the best way to do this? (2) Try using deque

多的职位(见亚历克斯·马尔泰利评论)建议内存碎片(他有大量的可用内存像我这样做的)〜但没有明显的修复性能这一点。

Multiple posts (see comment by Alex Martelli) suggested memory fragmentation (he has a large amount of available memory like I do) ~ but no obvious fixes to performance for this.

要复制的现象,请运行的答案下面提供的测试code和假设列表有有用的数据。

To replicate the phenomenon, please run the test code provided below in the answers and assume that the lists have useful data.

gc.disable()和gc.enable()与定时帮助。我会做的,所有的时间都花在了认真的分析。

gc.disable() and gc.enable() helps with the timing. I'll also do a careful analysis of where all the time is spent.

推荐答案

您观察到的表现不佳是由Python的垃圾收集器的错误引起的。 要解决此问题,禁用垃圾收集为​​您打造列表,并把它完成了。您会发现,业绩接近预期的名单在Python追加的amoritized 0(1)行为。

The poor performance you observe is caused by a bug in the Python garbage collector. To resolve this issue, disable garbage collection as you build the list and turn it on after you finish. You will find that performance approximates the amoritized 0(1) behavior expected of list appending in Python.

(您也可以调整垃圾收集器的触发器或选择性呼叫收集你的进步,而是因为他们是更复杂,我怀疑你的使用情况是服从上面的解决方案,我没有在这个答案探讨这些选项。)

(You can also tweak the garbage collector's triggers or selectively call collect as you progress, but I do not explore these options in this answer because they are more complex and I suspect your use case is amenable to the above solution.)

背景:

请参阅: http://bugs.python.org/issue4074 也<一个href=\"http://docs.python.org/release/2.5.2/lib/module-gc.html\">http://docs.python.org/release/2.5.2/lib/module-gc.html

记者注意到,追加复杂对象的对象(不是数字或字符串),以列表线性减慢的名单不断增加的长度。

The reporter observes that appending complex objects (objects that aren't numbers or strings) to a list slows linearly as the list grows in length.

这样做的原因行为是垃圾收集器检查和复查每一个对象在列表中,看看他们是否有资格进行垃圾回收。此行为导致的时间线性增加将对象添加到列表中。一个修复程序预计将在py3k土地,所以你PTER正在使用它不应适用于跨$ P $。

The reason for this behavior is that the garbage collector is checking and rechecking every object in the list to see if they are eligible for garbage collection. This behavior causes the linear increase in time to add objects to a list. A fix is expected to land in py3k, so it should not apply to the interpreter you are using.

测试:

我跑了测试,以证明这一点。为每千次迭代我10k的对象追加到列表,并记录每次迭代的运行时间。总的运行时间差别是显而易见的。随着测试的内循环在垃圾收集禁用,我的系统上的运行时间为18.6s。随着整个测试启用垃圾收集,运行时899.4s。

I ran a test to demonstrate this. For 1k iterations I append 10k objects to a list, and record the runtime for each iteration. The overall runtime difference is immediately obvious. With garbage collection disabled during the inner loop of the test, runtime on my system is 18.6s. With garbage collection enabled for the entire test, runtime is 899.4s.

这是测试:

import time
import gc

class A:
    def __init__(self):
        self.x = 1
        self.y = 2
        self.why = 'no reason'

def time_to_append(size, append_list, item_gen):
    t0 = time.time()
    for i in xrange(0, size):
        append_list.append(item_gen())
    return time.time() - t0

def test():
    x = []
    count = 10000
    for i in xrange(0,1000):
        print len(x), time_to_append(count, x, lambda: A())

def test_nogc():
    x = []
    count = 10000
    for i in xrange(0,1000):
        gc.disable()
        print len(x), time_to_append(count, x, lambda: A())
        gc.enable()

完整的源:<一href=\"http://hypervolu.me/~erik/programming/python_lists/listtest.py.txt\">http://hypervolu.me/~erik/programming/python_lists/listtest.py.txt

图形结果:红色是上,蓝色的是GC关闭GC。 y轴秒对数缩放。

Graphical result: Red is with gc on, blue is with gc off. y-axis is seconds scaled logarithmically.

由于两个曲线由几个数量级在y分量不同,这里它们独立地是具有线性缩放y轴

As the two plots differ by several orders of magnitude in the y component, here they are independently with the y-axis scaled linearly.

有趣的是,垃圾收集掉,我们看到每10K附加运行,这表明Python的名单再分配成本相对较低只有很小的尖峰。在任何情况下,它们比垃圾收集成本低许多数量级。

Interestingly, with garbage collection off, we see only small spikes in runtime per 10k appends, which suggests that Python's list reallocation costs are relatively low. In any case, they are many orders of magnitude lower than the garbage collection costs.

上述地块的密度,很难看到有垃圾收集,大多数的时间间隔实际上有不错的表现;只有当垃圾回收周期,我们所遇到的病态行为。您可以在此直方图10K追加时间观察这一点。大多数数据点下降约0.02秒每1万次追加。

The density of the above plots make it difficult to see that with the garbage collector on, most intervals actually have good performance; it's only when the garbage collector cycles that we encounter the pathological behavior. You can observe this in this histogram of 10k append time. Most of the datapoints fall around 0.02s per 10k appends.

用于生产这些地块的原始数据可以在<一个找到href=\"http://hypervolu.me/~erik/programming/python_lists/\">http://hypervolu.me/~erik/programming/python_lists/

The raw data used to produce these plots can be found at http://hypervolu.me/~erik/programming/python_lists/

这篇关于有没有办法规避的Python list.append()越来越慢成为一个循环的名单不断增加?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆