有没有办法避免 Python list.append() 随着列表的增长而在循环中逐渐变慢? [英] Is there a way to circumvent Python list.append() becoming progressively slower in a loop as the list grows?

查看:138
本文介绍了有没有办法避免 Python list.append() 随着列表的增长而在循环中逐渐变慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个正在读取的大文件,每隔几行就将其转换为一个对象的实例.

由于我正在循环访问文件,因此我使用 list.append(instance) 将实例存储到列表中,然后继续循环.

这是一个大约 100MB 左右的文件,所以它不会太大,但是随着列表变大,循环逐渐变慢.(我打印循环中每一圈的时间).

这不是循环固有的~当我在循环文件时打印每个新实例时,程序以恒定速度进行~只有当我将它们附加到列表时它才会变慢.

我的朋友建议在 while 循环之前禁用垃圾收集并在之后启用它 &进行垃圾回收调用.

有没有其他人观察到 list.append 变慢的类似问题?有没有其他方法可以绕过这个?

<小时>

我将尝试以下建议的两件事.

(1)预分配"内存~这样做的最佳方法是什么?(2) 尝试使用 deque

多个帖子(请参阅 Alex Martelli 的评论)建议内存碎片化(他像我一样有大量可用内存)~但对此没有明显的性能修复.

要复制这种现象,请运行下面答案中提供的测试代码,并假设列表中有有用的数据.

<小时>

gc.disable() 和 gc.enable() 有助于计时.我也会仔细分析一下所有的时间都花在哪里了.

解决方案

您观察到的性能不佳是由您使用的版本中的 Python 垃圾收集器中的错误引起的. 升级到 Python 2.7, 或 3.1 或更高版本以重新获得 Python 中列表附加预期的减刑 0(1) 行为.

如果您无法升级,请在构建列表时禁用垃圾收集并在完成后打开.

(您还可以在进行过程中调整垃圾收集器的触发器或有选择地调用 collect,但我不会在此答案中探索这些选项,因为它们更复杂,我怀疑您的用例适用于上述解决方案.)

背景:

请参阅:
(来源:
(来源:
(来源:
(来源:hypervolu.me)

用于生成这些图的原始数据可以在 http://hypervolu.me/中找到~erik/programming/python_lists/

I have a big file I'm reading from, and convert every few lines to an instance of an Object.

Since I'm looping through the file, I stash the instance to a list using list.append(instance), and then continue looping.

This is a file that's around ~100MB so it isn't too large, but as the list grows larger, the looping slows down progressively. (I print the time for each lap in the loop).

This is not intrinsic to the loop ~ when I print every new instance as I loop through the file, the program progresses at constant speed ~ it is only when I append them to a list it gets slow.

My friend suggested disabling garbage collection before the while loop and enabling it afterward & making a garbage collection call.

Did anyone else observe a similar problem with list.append getting slower? Is there any other way to circumvent this?


I'll try the following two things suggested below.

(1) "pre-allocating" the memory ~ what's the best way to do this? (2) Try using deque

Multiple posts (see comment by Alex Martelli) suggested memory fragmentation (he has a large amount of available memory like I do) ~ but no obvious fixes to performance for this.

To replicate the phenomenon, please run the test code provided below in the answers and assume that the lists have useful data.


gc.disable() and gc.enable() helps with the timing. I'll also do a careful analysis of where all the time is spent.

解决方案

The poor performance you observe is caused by a bug in the Python garbage collector in the version you are using. Upgrade to Python 2.7, or 3.1 or above to regain the amoritized 0(1) behavior expected of list appending in Python.

If you cannot upgrade, disable garbage collection as you build the list and turn it on after you finish.

(You can also tweak the garbage collector's triggers or selectively call collect as you progress, but I do not explore these options in this answer because they are more complex and I suspect your use case is amenable to the above solution.)

Background:

See: https://bugs.python.org/issue4074 and also https://docs.python.org/release/2.5.2/lib/module-gc.html

The reporter observes that appending complex objects (objects that aren't numbers or strings) to a list slows linearly as the list grows in length.

The reason for this behavior is that the garbage collector is checking and rechecking every object in the list to see if they are eligible for garbage collection. This behavior causes the linear increase in time to add objects to a list. A fix is expected to land in py3k, so it should not apply to the interpreter you are using.

Test:

I ran a test to demonstrate this. For 1k iterations I append 10k objects to a list, and record the runtime for each iteration. The overall runtime difference is immediately obvious. With garbage collection disabled during the inner loop of the test, runtime on my system is 18.6s. With garbage collection enabled for the entire test, runtime is 899.4s.

This is the test:

import time
import gc

class A:
    def __init__(self):
        self.x = 1
        self.y = 2
        self.why = 'no reason'

def time_to_append(size, append_list, item_gen):
    t0 = time.time()
    for i in xrange(0, size):
        append_list.append(item_gen())
    return time.time() - t0

def test():
    x = []
    count = 10000
    for i in xrange(0,1000):
        print len(x), time_to_append(count, x, lambda: A())

def test_nogc():
    x = []
    count = 10000
    for i in xrange(0,1000):
        gc.disable()
        print len(x), time_to_append(count, x, lambda: A())
        gc.enable()

Full source: https://hypervolu.me/~erik/programming/python_lists/listtest.py.txt

Graphical result: Red is with gc on, blue is with gc off. y-axis is seconds scaled logarithmically.


(source: hypervolu.me)

As the two plots differ by several orders of magnitude in the y component, here they are independently with the y-axis scaled linearly.


(source: hypervolu.me)


(source: hypervolu.me)

Interestingly, with garbage collection off, we see only small spikes in runtime per 10k appends, which suggests that Python's list reallocation costs are relatively low. In any case, they are many orders of magnitude lower than the garbage collection costs.

The density of the above plots make it difficult to see that with the garbage collector on, most intervals actually have good performance; it's only when the garbage collector cycles that we encounter the pathological behavior. You can observe this in this histogram of 10k append time. Most of the datapoints fall around 0.02s per 10k appends.


(source: hypervolu.me)

The raw data used to produce these plots can be found at http://hypervolu.me/~erik/programming/python_lists/

这篇关于有没有办法避免 Python list.append() 随着列表的增长而在循环中逐渐变慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆