Python:修改列表时的内存使用和优化 [英] Python: Memory usage and optimization when modifying lists

查看:116
本文介绍了Python:修改列表时的内存使用和优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我关注的是:我在一个经典的python列表中存储了一个相对论大数据集,为了处理我必须遍历的数据列表多次,对元素执行一些操作,并经常从列表中弹出一个项目。

My concern is the following: I am storing a relativity large dataset in a classical python list and in order to process the data I must iterate over the list several times, perform some operations on the elements, and often pop an item out of the list.

似乎从Python列表中删除一个项目需要花费O( N)因为Python必须将手头元素上方的所有项目复制到一个地方。此外,由于要删除的项目数量与列表中的元素数量大致成比例,因此会产生O(N ^ 2)算法。

It seems that deleting one item out of a Python list costs O(N) since Python has to copy all the items above the element at hand down one place. Furthermore, since the number of items to delete is approximately proportional to the number of elements in the list this results in an O(N^2) algorithm.

我希望找到一个具有成本效益的解决方案(时间和内存方面)。我已经研究了我在互联网上可以找到的内容,并在下面总结了我的不同选项。哪一个是最佳人选?

I am hoping to find a solution that is cost effective (time and memory-wise). I have studied what I could find on the internet and have summarized my different options below. Which one is the best candidate ?

while processingdata:
    index = 0
    while index < len(somelist):
        item = somelist[index]
        dosomestuff(item)
        if somecondition(item):
            del somelist[index]
        else:
            index += 1

这是我提出的原始解决方案。这不仅非常优雅,而且我希望有更好的方法来保持时间和内存效率。

This is the original solution I came up with. Not only is this not very elegant, but I am hoping there is better way to do it that remains time and memory efficient.

while processingdata:
    for i in xrange(len(somelist) - 1, -1, -1):
        dosomestuff(item)
        if somecondition(somelist, i):
            somelist.pop(i)

这可以避免增加索引变量,但最终的成本与原始版本相同。它还打破了dosomestuff(item)的逻辑,它希望以与它们在原始列表中出现的顺序相同的顺序处理它们。

This avoids incrementing an index variable but ultimately has the same cost as the original version. It also breaks the logic of dosomestuff(item) that wishes to process them in the same order as they appear in the original list.

while processingdata:
    for i, item in enumerate(somelist):
        dosomestuff(item)
    newlist = []
    for item in somelist:
        if somecondition(item):
            newlist.append(item)
    somelist = newlist
    gc.collect()

这是一种非常天真的策略,可以从列表中删除元素,因为几乎完整的副本需要大量内存列表必须是。

This is a very naive strategy for eliminating elements from a list and requires lots of memory since an almost full copy of the list must be made.

while processingdata:
    for i, item in enumerate(somelist):
        dosomestuff(item)
    somelist[:] = [x for x in somelist if somecondition(x)]

这是非常优雅的但是在封面下它再一次走完整个列表e并且必须复制其中的大部分元素。我的直觉是,这种操作可能比原始的del语句花费更多,至少在内存方面。请记住,某些列表可能很大,并且每次运行只会迭代一次的任何解决方案都可能总是赢。

This is very elegant but under-the-cover it walks the whole list one more time and must copy most of the elements in it. My intuition is that this operation probably costs more than the original del statement at least memory wise. Keep in mind that somelist can be huge and that any solution that will iterate through it only once per run will probably always win.

while processingdata:
    for i, item in enumerate(somelist):
        dosomestuff(item)
    somelist = filter(lambda x: not subtle_condition(x), somelist)

这也会创建一个新列表占用大量的RAM。

This also creates a new list occupying lots of RAM.

from itertools import ifilterfalse
while processingdata:
     for item in itertools.ifilterfalse(somecondtion, somelist):
         dosomestuff(item)

此版本的过滤器调用不会创建新列表,但不会在每个违反算法逻辑的项目上调用dosomestuff。我只是为了创建一个详尽的列表而包含这个例子。

This version of the filter call does not create a new list but will not call dosomestuff on every item breaking the logic of the algorithm. I am including this example only for the purpose of creating an exhaustive list.

while processingdata:
    index = 0
    for item in somelist:
        dosomestuff(item)
        if not somecondition(item):
            somelist[index] = item
            index += 1
    del somelist[index:]

这是一种看似符合成本效益的微妙方法。我认为它会将每个项目(或指向每个项目的指针)移动一次,从而产生O(N)算法。最后,我希望Python能够足够智能地在最后调整列表大小,而无需为列表的新副本分配内存。但不确定。

This is a subtle method that seems cost effective. I think it will move each item (or the pointer to each item ?) exactly once resulting in an O(N) algorithm. Finally, I hope Python will be intelligent enough to resize the list at the end without allocating memory for a new copy of the list. Not sure though.

class Doubly_Linked_List:
    def __init__(self):
        self.first = None
        self.last = None
        self.n = 0
    def __len__(self):
        return self.n
    def __iter__(self):
        return DLLIter(self)
    def iterator(self):
        return self.__iter__()
    def append(self, x):
        x = DLLElement(x)
        x.next = None
        if self.last is None:
            x.prev = None
            self.last = x
            self.first = x
            self.n = 1
        else:
            x.prev = self.last
            x.prev.next = x
            self.last = x
            self.n += 1

class DLLElement:
    def __init__(self, x):
    self.next = None
    self.data = x
    self.prev = None

class DLLIter:
    etc...

这种类型的对象以有限的方式类似于python列表。但是,保证删除元素O(1)。我不想去这里,因为这需要大量的代码重构几乎无处不在。

This type of object resembles a python list in a limited way. However, deletion of an element is guaranteed O(1). I would not like to go here since this would require massive amounts of code refactoring almost everywhere.

推荐答案

不知道具体是什么你正在使用这个列表,很难确切知道在这种情况下最好的是什么。如果您的处理阶段取决于列表元素的当前索引,这将不起作用,但如果不是,则表明您已经离开了最Pythonic(并且在许多方面,最简单)的方法:生成器。

Without knowing the specifics of what you're doing with this list, it's hard to know exactly what would be best in this case. If your processing stage depends on the current index of the list element, this won't work, but if not, it appears you've left off the most Pythonic (and in many ways, easiest) approach: generators.

如果您所做的只是迭代每个元素,以某种方式处理它,然后在列表中包含该元素,请使用生成器。然后你永远不需要将整个iterable存储在内存中。

If all you're doing is iterating over each element, processing it in some way, then either including that element in the list or not, use a generator. Then you never need to store the entire iterable in memory.

def process_and_generate_data(source_iterable):
    for item in source_iterable:
        dosomestuff(item)
        if not somecondition(item):
            yield item

你需要有一个处理循环来处理持久化已处理的迭代(将其写回文件,或其他),或者如果你有多个处理阶段,你宁愿分成不同的生成器你可以让你的处理循环将一个生成器传递给下一个。

You would need to have a processing loop that dealt with persisting the processed iterable (writing it back to a file, or whatever), or if you have multiple processing stages you'd prefer to separate into different generators you could have your processing loop pass one generator to the next.

这篇关于Python:修改列表时的内存使用和优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆